Google's DeepMind learns to reproduce human speech, tricks us into starting robot apocalypse

Google's DeepMind division is famous for its artificial intelligence work. Now the team has developed a new way of creating speech, that's better and more realistic sounding that everything else.

Vlad Dudau News Editor Neowin @avladd · Sep 9, 2016 12:30 EDT · Hot! with 9 comments

Google’s DeepMind AI division is famous for defeating the Go world champion and for performing medical research with the UK’s NHS. But the team has developed a very impressive new technology, that allows the AI, or deep neural network, to mimic human speech.

Talking to our robots, and them answering us back, has been a sort of dream for the field of artificial intelligence. In recent years the technology has gotten better, as most of you know from Siri, Cortana or Google Now voice interactions. But even our powerful digital assistants still rely on pre-recorded human voices, or they quickly turn cold and robotic.

However, the work done by the DeepMind engineers may help change that forever. The team decided to put speech synthesis through a deep neural network, something that wasn’t expected to really work. The result was an algorithm that could understand the way sounds follow each other on different timescales during speech in English and Mandarin. And what’s even more surprising is that the resulting program can seemingly outperform current state of the art systems.

After each output Wavenet feeds it back into the system for the next prediction

As mentioned above, existing text-to-speech (TTS) implementations usually rely on so-called concatenative algorithms. These work by employing a huge library of pre-recorded human sounds and phonemes with altering emphasis and emotions. Siri, Cortana, and even the voice of LCARS (will) work this way. It’s the reason why text-to-speech sometimes sounds choppy and emphasis may be wrongly placed on some syllables.

Parametric TTS, meanwhile, generates the sounds directly by relying on the information contained in the algorithm, without the need for a human recording. In principle, this approach should work better and be easier on our ears, but in practice this is the quintessential robotic voice. You’re probably familiar with the likes of Microsoft Sam from Windows XP, but even state of the art parametric TTS systems perform worse than concatenative versions.

However, DeepMind’s deep neural network model, dubbed Wavenet, performs better than either of those two previous implementations. You can listen for yourselves to the samples above – the first one is parametric, second is concatenative (think Siri), the last is the new Wavenet.

Even more impressive is that using this approach, the speech can be molded and altered to fit different voice, tones or emotions, without altering any of the underlying algorithm or databases. Also interesting is that the algorithm can create speech out of nothing, without the need for inputted text sequences during the training process. It mixes and matches sounds for a spooky, alien-like rendition of something that sounds close to, but is not quite English. What’s really fascinating here is that the system even creates breathing and mouth movement sounds, for an even more realistic sounding voice.

Finally, as a fun experiment, the researchers allowed the system to create music, after it was given a classical piano dataset. Note that the system wasn’t instructed how to create music, it was just left to its own device. The result? Well, listen for yourselves.

Unfortunately, there's little chance of us seeing these features in the very near future on our devices, because Wavenet still relies on heavy processing. Though given the advances in bandwidth and processors, our digital assistants may sound more real than ever soon. You can check out the full research paper here, or hit the source link below for extra samples and a few more technical details from the DeepMind team.

Source: DeepMind