The Azure AI Speech Personal Voice feature has been upgraded to a new zero-shot TTS model called DragonV2.1Neural. As a zero-shot model, it means voices can be created from minimal data. The new model promises “more natural-sounding and expressive voice” with “improved pronunciation accuracy and greater controllability.”

The new model can synthesize speech in over 100 languages with just a few seconds of a voice sample. The previous DragonV1 model had pronunciation challenges especially with named entities.

The new model can be used for a host of different applications including customizing chatbot voices and dubbing video content in an actor's original voice across multiple languages.

According to Microsoft, DragonV2.1 brings improvements to how natural the voices sound, “offering more realistic and stable prosody while maintaining better pronunciation accuracy.” The model also shows an average 12.8% relative Word Error Rate (WER) reduction compared to DragonV1. When using this model, you'll have fine-grained control over pronunciation and accent using SSML phoneme tags and custom lexicons.

With this model, Microsoft gives you control over the accent which is crucial for speech and video translation, as well as mimicking specific individuals. To help users get started, it has built several voice profiles such as Andrew, Ava, and Brian to help you test.

Microsoft’s new model increases the risk of deepfakes produced by malicious actors. To try and prevent misuse, the company is asking users to agree to usage policies, including explicit consent from the original speaker, disclosing synthetic content, and prohibiting impersonation or deception.

The Redmond giant will also automatically add watermarks to speech output. This technology reaches 99.7% detection accuracy in various audio editing scenarios, which could help to reduce the misuse of the AI voices.

You can try the personal voice feature on Speech Studio as a test, or apply for full access to the API for business use.

