ReSyncer: A versatile unified model for audio-video lip synchronization, speech style transfer, and face swapping

6 min readAug 28, 2024

ReSyncer is a new framework developed by Tsinghua University, Baidu and Nanyang Technological University S-Lab. It can generate highly realistic lip-synced videos synchronized with audio and has multiple functions such as personalized adjustment, video-driven lip synchronization, changing speaking style and face swapping.

High-fidelity audio-synced lip-synced videos: ReSyncer can create highly realistic videos of mouth movements that are accurately synced to the audio.
Personalized fine-tuning: Allow users to personalize the generated content to meet different needs.
Video-driven lip sync: In addition to audio, it can also drive synchronization based on the mouth movements of other videos, allowing the characters in the new video to imitate the speaking movements in the existing video.
Speaking style transfer: ReSyncer can transfer one person’s speaking style (such as tone and rhythm) to another person.
Face Swap: It can also replace the face of the speaker in the video while keeping the lip sync with the audio.

ReSyncer solves the following issues:

1. Multifunctional audio and video synchronization

Problem: Existing lip-syncing techniques are usually focused on specific tasks, such as generating lip-sync videos or performing facial editing. These techniques usually require specific training on long video clips, are inefficient, and may have visible defects in the quality of the generated videos.
Solution: The ReSyncer framework achieves efficient unified model training by reconfiguring the style-based generator and incorporating 3D facial dynamic information. This framework not only generates high-fidelity lip-sync videos, but also supports multiple features such as fast personalized fine-tuning, video-driven lip-sync generation, speaking style transfer, and even face-swapping.

2. High-quality lip-sync generation

Problem: Many existing methods rely on low-dimensional audio information to directly modify high-dimensional visual data, which may cause unstable mouth movements or other visual defects in the video. In addition, traditional methods are prone to leaving visible artifacts when processing high-quality videos.
Solution: ReSyncer uses 3D facial meshes as intermediate representations and combines them with the style-injected lip-sync converter (Style-SyncFormer) to generate high-quality, stable lip-sync videos through unified training. This framework effectively solves the problem of cross-domain information injection between audio and image domains, and improves the stability and visual quality of the generated results.

3. Unified face swapping and lip syncing

Problem: Traditionally, face swapping and lip syncing are usually handled separately. The two tasks require different models and training methods, resulting in low efficiency.
Solution: The ReSyncer framework implements face swapping and lip syncing in a unified model by leveraging 3D facial meshes and style space information. This enables the framework to achieve high-fidelity face swapping while maintaining high-quality lip sync generation, meeting the diverse needs of creating virtual performers.

Main Features

High-fidelity audio-synced lip-sync video generation

ReSyncer can generate lip-synced animation videos based on audio, ensuring that the mouth movements accurately match the input sound. For example, you can put a recording on a person’s face and make the person’s mouth movements and voice match exactly.
For example, if you have an audio clip and need to make a video of someone “saying” the clip, ReSyncer can perfectly match every subtle movement of the person’s mouth with the audio, and the resulting video looks like the person is really saying the clip.

Personalized fine-tuning

ReSyncer can quickly learn and adjust to a specific person’s mouth shape and facial movement patterns, with only a few seconds of video data, so you can use it to create personalized videos suitable for different people. It also supports personalized adjustments, allowing users to fine-tune the generated content to suit specific needs.
Suppose you want this tool to learn your mouth shape and facial expressions. It only needs to watch a few seconds of your video, and then it can generate a mouth animation that is very suitable for you, which feels like it is tailor-made for you.

Video-driven lip-sync

In addition to driving lip sync through audio, ReSyncer can also drive synchronization based on mouth movements in other videos, allowing the generated character to imitate the speaking movements in existing videos. This means you can use the movements in one video to control the mouth in another video.
For example, if you have two videos, one of which is someone speaking and the other is another person’s face, ReSyncer can make the person in the second video “speak” according to the mouth movements of the person in the first video, so that the two videos can be seamlessly combined.

Speaking style transfer

Not only can you match people’s mouths to audio, but you can also “transfer” one person’s speaking style (such as tone, rhythm, and expression) to another person, so that the generated video presents a specific speaking style. For example, you can make one person “speak” in the way another person speaks.
For example, if you have a speaker who always speaks slowly and methodically, ReSyncer can allow another person to imitate the speaker’s style while speaking, and the resulting video will make it feel like the other person has learned how the speaker speaks.

Face Swap

The framework also supports high-quality face swapping, which can replace the speaker’s face in the video while keeping the mouth movements, expressions and audio in sync. This means not only can the face be swapped, but the swapped face can also continue to be synchronized with the sound. This allows users to seamlessly replace different faces in the video, which is suitable for a variety of creative scenarios.
ReSyncer can not only do this, but also ensure that the mouth shape of the replaced face still accurately matches the audio when speaking, making it look as if the “new face” originally belonged to this body.

Versatile unified model

A notable feature of ReSyncer is that it implements all of the above functions through a unified model. This means that users do not need to use different tools for different tasks (such as lip sync and face swapping), ReSyncer can complete all of these tasks with one model.
You only need one tool to complete all these complex tasks, saving time and energy.

Real-time processing and application

ReSyncer can be used in real-time live broadcasts, and it can generate video output synchronized with sound in real time. This means that you can use it to make a virtual character “speak” in a live broadcast, and the character’s mouth shape will be synchronized with the sound in real time.
If you use an avatar in your live stream, ReSyncer can synchronize the character’s mouth movements with your speech, making it seem like you are speaking live. This is very helpful for live streams that require a virtual host or digital avatar.

……

For more info ↓

ReSyncer: A versatile unified model for audio-video lip synchronization, speech style transfer, and…

ReSyncer is a new framework developed by Tsinghua University, Baidu and Nanyang Technological University S-Lab. It can…

kcgod.com

More about AI: https://kcgod.com