Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio

But it could lead to a proliferation of deepfake voices.

Ben Wodecki, AI Business

January 11, 2023

2 Min Read
Microsoft’s VALL-E Generates Speech From Just 3 Seconds of Audio

Microsoft has unveiled VALL-E: an AI model that can generate speech audio from just three-second samples.

VALL-E is capable of text-to-speech synthesis (TTS) off little prior data and could be used for tasks such as speech editing and content creation when combined with other generative AI models like GPT-3.

Trained on 60,000 hours of English language speech from Meta’s LibriLight audio library, VALL-E essentially mimics the target speaker and what they would sound like when speaking a desired text input. It can also maintain the emotion of the speaker in the sample audio.

VALL-E can be demoed via GitHub. According to the Microsoft researchers behind it, the model “significantly outperforms” other zero-shot TTS systems in terms of speech naturalness and speaker similarity.

One possible use for VALL-E could be to narrate audiobooks. Just last week, Apple published a series of audiobooks narrated by an AI voice via its Books app.

For Microsoft, VALL-E represents its latest foray into generative AI. The tech giant is already exploring ways to incorporate OpenAI’s ChatGPT into its Bing search engine and Office line of products.

VALL-E: How Does It Work?

Microsoft describes VALL-E as a neural codec language model. The model was trained on discrete codes derived from the LibriLight library.

Related:How Artificial Intelligence Will Evolve in 2023

During the pre-training stage, the training data used to build VALL-E was scaled up to make it “hundreds of times larger than existing (TTS) systems” like CereProc’s CereVoice or ReadSpeaker, according to the research team behind the model.

“While advanced TTS systems can synthesize high-quality speech from single or multiple speakers, it still requires high-quality clean data from the recording studio. Large-scale data crawled from the Internet cannot meet the requirement, and always lead to performance degradation,” according to the paper’s authors.

“Because the training data is relatively small, current TTS systems still suffer from poor generalization. Speaker similarity and speech naturalness decline dramatically for unseen speakers in the zero-shot scenario.”

Continue reading this article on AI Business

Read more about:

MicrosoftAI Business

About the Authors

Ben Wodecki

Assistant Editor, AI Business

Ben Wodecki is assistant editor at AI Business, a publication dedicated to the latest trends in artificial intelligence.

AI Business

AI Business, an ITPro Today sister site, is the leading content portal for artificial intelligence and its real-world applications. With its exclusive access to the global c-suite and the trendsetters of the technology world, it brings readers up-to-the-minute insights into how AI technologies are transforming the global economy - and societies - today.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like