Emu Video & Emu Edit: new generative AI models by Meta
Meta AI releases two new generative AI models: Emu Video and Emu Edit
Emu Video: is a text-to-video generation model, based on the diffusion model, which can generate images through text prompts, and then generate videos based on text.
Emu Edit: It is a command-driven image editing model that can perform free-form editing through commands, including local and global editing, background addition/removal, color and geometric transformation, etc.
Compared with the previous need for multiple models, Emu Video uses only two diffusion models to generate a 512x512 resolution, four-second long video at 16 frames per second. Human evaluation shows that Emu Video scores highly in terms of quality and fidelity to textual prompts compared to previous methods.
The core of Emu Edit is to precisely modify only the pixels relevant to the edit request, leaving other pixels unchanged. To train this model, Meta developed a dataset of 10 million synthetic samples, each consisting of an input image, task description, and target output image.
Detailed
EMU VIDEO: A generative model that converts text into video, improving the quality and resolution of the video by first generating images and then generating videos.
1. Text-to-video generation: The core function of this model is to convert text descriptions into videos, that is, to generate corresponding video content based on given text descriptions.
2. Step-by-step processing: Different from the conventional method of directly generating videos, EMU VIDEO adopts a step-by-step method. First, a static image is generated based on the text, and then a video is generated based on this image and the original text.
3. Quality improvement: Through this step-by-step approach, EMU VIDEO is able to generate higher quality and resolution videos, which was confirmed in human evaluation.
4. Image animation: The model is particularly suitable for image animation generation based on user text prompts, that is, it can convert static images into dynamic videos.
5. High-resolution video generation: EMU VIDEO uses a specially adjusted noise plan and multi-stage training to directly generate high-resolution videos.
6. Quality and textual fidelity: Compared with previous models, EMU VIDEO performs better in terms of video quality and fidelity to the original text description.
Detailed
Emu Edit is a command-driven image editing model that can understand and execute various complex editing commands while ensuring image quality.
1. Multifunctional editing: Users can use it to perform various image editing tasks, such as changing a certain area, free-form editing, and even some complex computer vision tasks such as image detection and segmentation.
2. Intelligent processing: Emu Edit can not only handle a variety of tasks, but also intelligently adjust the editing process based on user input. This is achieved through a technology called “learned task embedding”, which helps the tool understand and execute the user’s editing instructions more accurately.
3. Quickly adapt to new tasks: Even for some tasks that Emu Edit has not been directly trained on before, such as ultra-high-resolution processing or contour detection, it can quickly learn and adapt.
4. Continuous editing and quality maintenance: Emu Edit introduces a method to maintain the quality of the generated images in multiple rounds of editing scenarios, by applying a pixel thresholding step after each edit, reducing accumulated reconstruction and numerical errors.
Detailed
More AI News
Artificial Intelligence Article
New AI Technology