Character.AI unveils TalkingMachines for real-time, "FaceTime-style" video generation

Character.AI has shared an update about some of its research into generative video. The company has demonstrated AI-powered videos that can chat in real-time.

Character.AI has shared some research it has been doing on generative video. The company has developed a new autoregressive diffusion model called TalkingMachines that’s able to generate real-time, audio-driven video of AI characters from just an image and a voice signal. With this, the company is getting closer to FaceTime-style visual interactions with AI characters.

It’s important to understand that at this point, it’s still research. There is a research paper and video demos, but you cannot use it in the Character.AI app yet.

If this ever does filter down to the Character.AI app, it will allow users to engage in more immersive roleplay with AI, engage in interactive storytelling, and allow for visual world-building.

The new TalkingMachines model is built on something called Diffusion Transformer (DiT) technology, which is essentially an artist that can create detailed images from random noise, refining the image until it's perfect. What Character.AI has done is make it work incredibly fast, so it feels real-time.

To achieve its breakthroughs, TalkingMachines leverages several key techniques, including: Flow-Matched Diffusion, Audio-Driven Cross Attention, Sparse Causal Attention, and Asymmetric Distillation.

The Flow-Matched Diffusion is trained on lots of motions, including subtle face expressions, to more dramatic gestures. This helps to ensure AI characters move more naturally. The exciting work is delivered by the Audio-Driven Cross Attention, which lets the AI not just hear words, but also understand the rhythm, pauses, and inflections in the audio and then translate this into precise mouth movements, head nods, and eye blinks.

With Sparse Causal Attention, Character.AI can process the frames of the video in a much more cost-efficient manner and with Asymmetric Distillation, videos can be generated in real-time to make it seem like a FaceTime call.

Character.AI insists that this research breakthrough isn’t just about facial animation. It says that it’s a step towards interactive audiovisual AI characters that you can interact with in real time. The model supports a wide range of styles, including photorealistic humans, anime, and 3D avatars, and it enables streaming with natural listening and speaking phases.

This feature is not ready for the app just yet, with Character.AI saying it’s still in research. If the company does launch it, it will definitely be one of the first companies to achieve this, if not the first and is definitely a notable milestone in the AI race.