Successful cases

Elhuyar’s neural TTS system:

Voice cloning

Elhuyar’s neural TTS system converts text into speech and uses Orai’s technology. Elhuyar’s automatic speech is generated by AI-based technology.

Overview

Elhuyar’s neural TTS system converts text into speech and uses Orai’s technology. Elhuyar’s automatic speech is generated by AI-based technology.

Besides making its in-house TTS voices available, Elhuyar also offers the possibility of generating custom TTS voices. The system can be used in six languages, thus giving users a wider range of possibilities of automatically turning specific content as well as isolated texts into speech. What is more, a custom voice can be used in any of the six languages to generate speech, even if the recordings for generating a custom voice may not be available in the required language. Even if the TTS system has not been trained for a specific language, speech that imitates the voice in a small sample can be generated. There are various options for using these TTS voices: by using the ttsneuronala.elhuyar.eus web box, by inserting a reading bar for websites into the web pages or by integrating them via REST API.

Challenge

Modern Text to Speech (TTS) systems face major challenges as they grow in complexity and capabilities. Voice cloning requires high fidelity and expressiveness with a minimum of training data. Zero-shot TTS aims to synthesize a new voice without explicit training, while maintaining the natural prosody and imitation of speakers. Interlingual TTS adds another level of complexity, because it requires accurate pronunciation and intonation in a wide variety of languages, often with limited multilingual data per speaker. These capabilities need to be set against computational efficiency and real-time performance.

Collaboration

Elhuyar’s resources based on AI and neural networks use state-of-the-art technology developed by Orai and which is constantly being updated.

Result

Neural TTS not only enables custom voices that sound authentic to be generated in six languages, it also allows a reading bar to be integrated into websites, and synthetic voices to be integrated into applications via APIs.

Project images