Microsoft VALL-E will imitate our voice after just 3 seconds of speaking

Telegram LogoAre you interested in OFFERS? Be smart and subscribe to our Telegram channel! Lots of discount codes, offers, pricing errors on technology from Amazon and the best online stores.

In just 3 seconds, an AI that has never heard you speak can imitate your voice perfectly. This is the latest achievement of Microsoft's artificial intelligence - the VALL-E text-to-speech model, which can copy anyone's voice at will with just 3 seconds of speech.

Microsoft VALL-E will imitate our voice after just 3 seconds of speaking

It originated from DALL E, but specializes in the audio field, and the text-to-speech effect became popular after it was released online.

Some users said that if VALL·E and ChatGPT are combined, the result will be amazing. For others, it seems that the day when it will be possible to make video calls with AI is not far away. There are even those who joke that after the AI ​​has taken care of the writers and painters, next are the voice actors.

But how does VALL·E imitate an “unheard of” sound in 3 seconds?

VALL-E analyzes audio with language models. It synthesizes speech based on AI “unheard” sounds, i.e. zero-sample learning.

The traditional text-to-speech solution is basically a pre-workout mode along with a fine-tuning. If used in a zero sample scenario, it will result in poor similarity and naturalness of the generated speech.

Based on this, VALL-E came out of nowhere, proposing a different idea than the traditional vocal model.

Compared with the traditional model that uses the Mel spectrum to extract features, VALL-E directly takes speech synthesis as a task of the language model, the former is continuous and the latter is discrete.

In particular, the traditional speech synthesis process is often the path of “phoneme → mel-spectrogram (mel-spectrogram) → waveform”.

But VALL -E transformed this process into “phoneme→discrete audio coding→waveform”:

In terms of model design, VALL-E is also similar to VQVAE. Quantizes audio into a series of discrete tokens. The first quantizer is responsible for capturing the audio content and identity characteristics of the speaker, while the second quantizers are responsible for signal refinement. which sounds more natural:

Then conditioned by the text and the 3-second audio prompt, it autoregressively outputs a discrete audio encoding:

But not only that, in addition to zero-sample speech synthesis, VALL-E also supports voice editing and voice content creation combined with GPT-3.

The ambient background sound can also be restored

Judging by the synthesized vocal effects, VALL-E can restore more than just the speaker's timbre.

Not only is the pitch imitated on the spot, but it also supports a variety of different speech speeds. For example, these are two different speech speeds provided by VALL-E when the same sentence is spoken twice, but the tonal similarity is still high:

At the same time, the background ambient sound of the other party can also be accurately restored.

Additionally, VALL-E can mimic a variety of the speaker's emotions, including several types such as angry, sleepy, neutral, joy, and nausea.

It is worth mentioning that the data set used for the VALL·E training is not particularly large.

Compared to OpenAI's Whisper, which required 680.000 audio training hours and only used more than 7.000 speakers and 60.000 training hours, VALL-E surpassed pre-trained text-to-speech in terms of similarity to Model YourTTS text-to-speech.

Furthermore, YourTTS heard the voices of 97 out of 108 speakers in advance during the traning, but it still falls short of VALL-E in the actual test.

As for the fields in which it can be applied:

Not only can it be used to mimic your own voice, such as helping disabled people complete a conversation with others, but you can also use it to speak for you when you don't want to. Of course, it can also be used for audio book recording.

However, VALL-E is not open source yet and you may need to wait a little longer to try it out.

Prices on Amazon

2 new from 424,99 €
1 used starting at € 402,29
to June 5, 2023 13:31
Last updated on June 5, 2023 13:31
Pierpaolo Figuccia

Pierpaolo Figuccia

Nerd, passionate about technology, photography and video maker. And of course I love Xiaomi products!


0 Post comments
Inline feedback
View all comments