Meta AI releases Sapiens visual model specifically designed to analyze and understand people and their actions in images or videos

5 min readAug 25, 2024

Cloudways — The Best Managed Cloud Hosting | Web Hosting

Meta Reality Labs has developed a set of artificial intelligence models called “Sapiens”. It mainly provides high-resolution models for processing human vision tasks, specifically designed to analyze and understand people and their actions in pictures or videos. These tasks include recognizing human postures, segmenting body parts, measuring depth, and judging the angle of the surface of objects. The model has been trained on more than 300 million human images and can perform well in various complex environments.

2D Pose Estimation: Recognizing and estimating the pose of the human body in 2D images.
Body part segmentation: Accurately segment the human body parts in the image, such as identifying and distinguishing different parts such as hands, feet, and head.
Depth Estimation: Predicting the depth of objects in an image, which helps understand distances and layout in 3D space.
Surface Normal Prediction: Inferring the orientation of an object’s surface in an image is important for better understanding the object’s shape and material.

These models can handle very high-resolution images and perform well with very little labeled data or even completely synthetic data, making them very useful in real-world applications, especially when data is scarce.

In addition, the Sapiens model is simple in design and easy to expand. When the number of parameters of the model is increased, its performance in various tasks will be significantly improved. In multiple tests based on human vision, the Sapiens model has surpassed the existing baseline model and performed well.

Application Scenario

The Sapiens model is mainly used in several key human vision task areas. Its application scenarios and uses include:

1. 2D Pose Estimation

Application scenarios: 2D posture estimation is one of the key technologies in video surveillance, virtual reality, motion capture, medical rehabilitation and other fields. It can recognize human posture, movement and gestures.
Functionality: Sapiens can accurately detect and predict key points of the human body (such as joints, facial features, etc.), and works well even in multi-person scenes. This makes it have broad application potential in motion analysis and human-computer interaction.

2. Body Part Segmentation

Application scenarios: Accurate human body part segmentation is a basic technology in fields such as medical image analysis, virtual fitting, animation production, and augmented reality (AR).
Functionality: The Sapiens model can accurately classify each pixel in an image into different parts of the body (such as upper body, lower body, facial details, etc.). This helps develop more sophisticated virtual clothing fitting, medical diagnostic tools, and more natural virtual character animation.

3. Depth Estimation

Application scenarios: Depth estimation is crucial in autonomous driving, robot navigation, 3D modeling and virtual reality, helping to understand the three-dimensional structure in the scene.
Function: The Sapiens model is able to infer the depth information of a scene from a single image, especially in human scenes. By generating high-quality depth maps, it supports a variety of applications that require understanding spatial relationships, such as obstacle detection in autonomous driving and robot path planning.

4. Surface Normal Prediction

Application scenarios: Surface normal prediction is widely used in 3D rendering, physical simulation, reverse engineering, and lighting processing.
Function: The Sapiens model can infer the surface normal direction of each pixel in the image, which is essential for generating high-quality 3D models and achieving more realistic lighting effects. This function is particularly important in applications that require precise surface features, such as virtual reality and digital content creation.

5. Common Human Vision Tasks

Application scenarios: The Sapiens model can be applied to any scenario that requires understanding and analyzing human images, including social media content analysis, security monitoring, sports science research, and digital human generation.
Function: Due to its strong performance on multiple tasks, Sapiens can be used as a general base model to support various human-centric vision tasks, thereby accelerating the development of related applications.

6. Virtual Reality and Augmented Reality

Application scenarios: Virtual reality (VR) and augmented reality (AR) applications require highly accurate understanding of human posture and structure to achieve an immersive experience.
Function: Sapiens supports the creation of realistic human images in virtual environments by providing high-resolution, accurate human pose and part segmentation, and can dynamically adapt to changes in user movements.

7. Medical and Health

Application scenarios: In medical imaging and rehabilitation training, accurate posture detection and human segmentation can be used for patient monitoring, treatment tracking and rehabilitation guidance.
What it does: The Sapiens model helps medical professionals analyze patients’ posture and movement to provide more personalized and effective treatment plans.

Technical methods

1. Dataset and preprocessing

Humans-300M dataset: The pre-training dataset for the Sapiens model is Humans-300M, a large-scale dataset containing 300 million “in-the-wild” human images. The dataset has been carefully curated to remove watermarks, text, artistic depictions, or unnatural elements.
Data filtering: We use a pre-trained bounding box detector to filter images and only keep images with a detection score higher than 0.9 and a bounding box size larger than 300 pixels to ensure data quality.
Multi-view capture and annotation: In order to accurately capture human body postures and parts, multi-view capture technology is used to acquire images, and 308 key points and 28 body part categories are manually annotated to generate high-quality annotated data.

2. Model Architecture

Vision Transformers (ViT): The Sapiens model uses the Vision Transformers (ViT) architecture, which has performed well in image classification and understanding tasks. By dividing the image into fixed-size non-overlapping patches, the model is able to handle high-resolution inputs and perform fine-grained reasoning.
Encoder-Decoder Architecture: The basic architecture of the model is an encoder-decoder. The encoder is responsible for extracting features from the image and initialized to pre-trained weights, while the decoder is a lightweight and task-specific module that is randomly initialized and fine-tuned together with the encoder.

……

For more info ↓