How to Create Talking Virtual Characters with Microsoft GAIA
GAIA: Ability to synthesize natural-looking talking avatar videos from speech and a single portrait image.
It even supports text prompts such as “sad,” “open mouth,” or “surprised” to guide video generation.
GAIA also allows you to precisely control every facial movement of your avatar, such as a smile or an expression of surprise.
You can accept voice, video or text commands to create a talking avatar video.
The main features:
1. Generate a talking virtual character based on voice: If you give GAIA a voice recording, it can create a video of an avatar whose lips and facial expressions will move in response to the voice.
2. Generate a talking virtual character based on the video: GAIA can observe the actions of a real person in the video, and then create a virtual character to imitate these actions.
3. Control the head posture of the avatar: You can tell GAIA to make the avatar’s head make specific movements, such as nodding or shaking its head.
4. Complete control of avatar’s expressions: GAIA allows you to precisely control every facial movement of avatars, such as smiles or expressions of surprise.
5. Generate virtual character actions based on text instructions: You can give GAIA some text instructions, such as “Please smile”, and it will create a virtual character video that follows these instructions.
How GAIA works:
- Separate motion and appearance representation:
•GAIA first separates each video frame into a representation of motion and appearance. This means it can distinguish between parts that move as a result of speaking (such as lip movement) and which parts remain the same (such as hair, eye position).
2. Use a variational autoencoder (VAE):
• VAE is used to encode these separated representations in video frames and reconstruct the original frames from these representations. This process helps the model learn how to accurately capture and reproduce a character’s facial features and expressions.
3. Speech-based motion sequence generation:
•The diffusion model is optimized to generate motion sequences based on speech sequences and reference portrait pictures. This means that the model can generate corresponding facial movements based on a given speech input (such as a conversation).
4. Application in reasoning process:
•In practical applications, the diffusion model accepts input speech sequences and reference portrait pictures as conditions and generates motion sequences. These motion sequences are then decoded into videos showing the avatar’s speech and facial expressions.
5. Application of control and text instructions:
•GAIA also allows you to control arbitrary facial attributes by editing facial landmarks during generation, or generate video clips of virtual avatars based on text instructions.
Projects and demos
Paper
GAIA not only automatically generates head movements based on voice, but also allows users to customize head movements. For example, if a user wants the avatar to shake or nod while speaking, they can specify such an action, and GAIA will be able to achieve this without affecting the synchronization of lip movements with speech.
This increases the flexibility and controllability of virtual avatar video generation, making it more suitable for a variety of different application scenarios.