PIXART-α: a Transformer-based text-to-image generation model
PIXART-α: is a Transformer-based text-to-image generation model.
Its image generation quality is comparable to the current state-of-the-art image generators (such as Imagen, SDXL and Midjourney), reaching a standard close to commercial applications. Supports high-resolution image synthesis up to 1024px resolution, and the training cost is very low, it only requires 10.8% of the training time of SD v1.5.
This model aims to solve a series of common problems in the T2I field, such as misalignment between text and images, low training efficiency, and high training costs.
Compared with other large-scale T2I models, the training speed and cost of PIXART-α are greatly reduced. Specifically, it requires only 10.8% of the training time of Stable Diffusion v1.5, saving nearly $300,000 in costs, and reducing CO2 emissions by 90%. Compared to the larger SOTA model RAPHAEL, the training cost is only 1%.
Experiments show that PIXART-α performs well in terms of image quality, artistry, and semantic control.
Projects and demos: https://pixart-alpha.github.io
Paper: https://arxiv.org/abs/2310.00426
PDF: https://arxiv.org/pdf/2310.00426.pdf…
working principle:
1. Training strategy decomposition: PIXART-α uses three independent training steps to optimize for pixel dependence, text-image alignment and image aesthetic quality respectively. This decomposition helps the model learn different aspects of features more effectively.
2. Efficient T2I Transformer: Based on the Diffusion Transformer (DiT), the model introduces a cross-attention module. These modules are used to inject textual conditions, allowing the model to generate corresponding images based on given text.
3. High information content data: For better text-image alignment, PIXART-α emphasizes the importance of concept density in text and image pairs. It uses a large visual-language model to automatically generate dense pseudo-captions to assist in the learning of text-image alignment.
Training strategy:
1. Pixel dependency learning: This is the first stage of training and focuses on how to generate pixels that are semantically coherent and reasonable in a single image.
2. Text-image alignment learning: This is the second stage of training, focusing on how to achieve accurate alignment between text concepts and images.
3. Aesthetic quality optimization: After all basic training is completed, the model further optimizes the aesthetic quality of the generated images.
Let me explain how PIXART-α works through a simple example.
Example: Generating an image of “a dog playing Frisbee in the park”
Step 1: Pixel dependency learning
At this stage, the model first tries to understand how the elements “dog,” “park,” and “frisbee” should behave in the image. It generates a basic image frame containing the underlying pixel representation of these elements.
Step 2: Text-image alignment learning
Next, the model further refines this base image to ensure that it is closely aligned with the input text “A dog plays Frisbee in the park.” For example, it adjusts the dog’s position to make sure it actually looks like it’s “in the park” and “playing Frisbee.”
Step 3: Aesthetic quality optimization
Finally, the model further refines the image, optimizing color, texture, and other visual elements to improve the overall aesthetic quality of the image.
Output:
Ultimately, the model produces a high-quality, high-resolution image of a dog playing Frisbee in a park that closely matches the input text.
Advantage:
Efficiency: Since the training strategy of the model is decomposed, it can complete these steps faster, and the training cost and time are greatly reduced compared to other models.
Customizability: Because the model is able to understand and align multiple elements related to the input text, it can produce very customized images.