Kuaishou releases “KeLing” video model, similar to Sora’s technical route, which can generate more than 120 seconds of 1080P video

Brain Titan
3 min readJun 9, 2024

--

Kuaishou’s latest domestically produced video generation model, the “KeLing” model, adopts a technology route similar to Sora and combines a number of self-developed technological innovations. It can generate videos with a duration of more than 120 seconds and a resolution of up to 1080p, and can accurately model complex motion and physical properties.

Main features

1. High-quality video generation

  • Duration and frame rate : KeLing supports the generation of ultra-long videos with a duration of up to 2 minutes and 30fps.
  • Resolution : The resolution of the generated video is up to 1080p, with clear and delicate picture quality.
  • Aspect ratio : Supports video generation with multiple aspect ratios, including vertical videos, to suit different usage scenarios and platforms.

2. Physical world simulation

  • Realistic physical properties : The Klingda model can simulate the physical properties of the real world, such as gravity, light and shadow reflection, liquid flow, etc.
  • Detailed depiction : The depiction of details such as object movement, surface reflection, shadow changes, etc. is very accurate, providing a realistic visual experience.

3. Complex motion characterization

  • Precise Motion Modeling : Ability to accurately model complex and large-scale motion scenes, such as animals running at high speed, astronauts walking on the moon, etc.
  • Continuity : The generated video images are coherent, the movements are smooth, and the subtle changes during the movement can be realistically reproduced.

4. Various control information input

  • Control information input : supports users to input control information such as camera movement, frame rate, edge/key point/depth, and provides rich content control capabilities.
  • Text prompt word optimization : A dedicated language model is designed to perform high-quality expansion and optimization of the prompt words entered by users, thereby improving the generation effect.

Technical realization

1. Model design

  • Sora-like architecture : It adopts a Sora-like DiT structure and uses Transformer to replace the convolutional network in the traditional diffusion model to improve the generation capability and scalability.
  • 3D VAE network : Self-developed 3D VAE network achieves spatiotemporal synchronous compression and improves video reconstruction quality.
  • Full Attention Mechanism : A 3D Attention mechanism is designed for spatiotemporal modeling, which can accurately model complex spatiotemporal motion while taking into account computational efficiency.

2. Data protection

  • Labeling system : A complete labeling system has been built to fine-tune and adjust the training data to ensure the high quality of video data.
  • Video description model : A video description model was developed to generate accurate, detailed, and structured video descriptions and improve the responsiveness to text commands.

3. Computational efficiency

  • Distributed training cluster : Use distributed training clusters to significantly improve hardware utilization through operator optimization, recalculation strategy optimization, and other means.
  • Phased training strategy : A phased training strategy is adopted, first enhancing the model capabilities through a large amount of data in the low-resolution stage, and then improving the detail performance in the high-resolution stage.

Some examples

Large-scale reasonable exercise

Video generation up to 2 minutes long

Simulating physical world properties

  • Strong concept combination ability

Movie-quality image generation

Supports free output video aspect ratio

Expression and body drive

Based on self-developed 3D face and body reconstruction technology, combined with background stability and redirection modules, the expression and body full drive technology is realized. With only a full-body photo, you can experience the vivid “singing and dancing” gameplay

--

--