Meta announces Voicebox, its generative AI model for audio

Voicebox AI Meta

Today, Meta has announced its latest generative AI model, following on the back of ImageBind is Voicebox, which is designed to help creators with its ability to perform speech generation tasks such as audio editing, sampling and stylising, even if it wasn't specifically trained to do so through in-context learning.

Meta touts that this new AI model will have benefits to many people around the world, and uses examples such as helping visually impaired people to hear written messages from friends in their voices, as well as allowing people to speak foreign languages in their own voices.

The AI model itself can produce both high-quality audio clips, and edit pre-recorded audio to remove unwanted disruptions such as car horns while preserving the content and style of the audio while being multilingual, producing speech in six languages. Future developments for the model include giving natural-sounding voices to visual assistants, or non-player characters during games in the metaverse.

Meta has also compared Voicebox to other audio AI models out there, specifically naming Vall-E and YourTTS as competitors, demonstrating that Voicebox is more advanced and outperforms both models when comparing Word error rates and Style similarity.

Voicebox AI

Voicebox has been built on a Flow Matching model, which is Meta's latest non-autoregressive generative model, which can learn highly non-deterministic mapping between text and speech, enabling Voicebox to learn from varied speech data without it having to be carefully labelled allowing the data to be more diverse and on a larger scale.

Voicebox has been trained on more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese so far, and it can also predict a speech segment when given the surrounding speech and the transcript of the segment.

Lastly, Meta goes on to comment that while the technology can bring in a new era of generative AI for speech, it could bring the potential for misuse and unintended harm.

In the research paper that Meta will be sharing about Voicebox, it will include detail on how it has built a highly effective classifier which can differentiate between authentic speech and Voicebox-generated speech.

Meta will not be making the AI program itself available for public use, nor will it be releasing the source code.