Pegasus-1: An LLM that can read videos
Pegasus-1: A large model that can really read videos
Twelve Labs launches Pegasus-1, an advanced video-language based model with approximately 80 billion parameters.
It can process video content ranging from 10 seconds to hours long, and understand, recognize and parse the video to generate a more comprehensive and accurate text description.
It can comprehensively understand the characters, objects, scenes, background music, dialogue, etc. in the video.
The main function:
1. Multimodal understanding:
Pegasus-1 not only processes the visual information of the video, but also understands the audio and speech information. This means that it can more comprehensively understand the video content, including the characters, objects, scenes, background music, dialogue, etc. that appear in the video.
2. Efficient long video processing:
The model is optimized for managing and processing videos of varying lengths, from as short as 10 seconds to hours of content.
3. Video-text generation:
With a single API call, developers can cause the Pegasus-1 model to generate specific text output from their video data. This includes but is not limited to video summarization, key point extraction, automatically generated tags and titles, etc.
4. Advanced performance indicators:
Pegasus-1 shows relative improvements of 61% and 47% over existing state-of-the-art models on the MSR-VTT dataset and Video Descriptions dataset.
5. API access:
Pegasus-1 provides a set of video-to-text APIs that are very flexible and can be used for a variety of downstream tasks.
Detailed introduction: https://t.co/xnm0EtqpYW
Working principle of Pegasus-1:
Unlike many approaches that frame video understanding as an image or speech understanding problem, Twelve Labs adopts a “video first” strategy.
This strategy has four core principles: efficient long video processing, multimodal understanding, video local embedding, and deep alignment between video and language embeddings.
The model consists of three main components: a video encoder, a video-language alignment model, and a language decoder.
1. Video encoder: Responsible for extracting visual, audio and voice information from the video and generating video embeddings (Embeddings). It evaluates video frames and their temporal relationships to obtain relevant visual information, while processing audio signals and speech information.
2. Video-Language Alignment Model: This step is the key to bridging the fields of video embedding and language models. It ensures that the language model can interpret video embeddings in a similar way to understanding text markup.
3. Language decoder: Leveraging its extensive knowledge base, the decoder interprets aligned embeddings based on input user prompts and decodes this information into coherent, easy-to-read text.
These three components are trained together to enable the model to more accurately understand and generate text related to video content.
Data set of Pegasus-1:
Twelve Labs has a collection of over 300 million diverse, curated video-text pairs. This makes it one of the largest video-text corpora used for training video-language based models.
Initial training subset: The technical report is based on an initial training run containing 35 million video-text pairs and over 1 billion image-text pairs. This subset accounts for approximately 10% of the total data set.
This dataset is not only large in size, but also high in quality and rich in diversity, which helps Pegasus-1 achieve advanced performance on multiple evaluation metrics.
Introduction to MSR-VTT data set: https://t.co/UCpPKsMqAB
ideo-ChatGPT video description data set:
Pegasus-1 is not just a single model, but a holistic solution that provides a series of APIs to meet different video to text conversion needs.
Gist API: This API is designed to generate concise text output, such as the title of the video, the topic or related tags (hashtags). It comes preloaded with relevant prompts, so it’s plug-and-play and requires no additional input from the user.
Summary API: This API is designed to generate summaries, chapters, and highlights of videos. Like the Gist API, it comes preloaded with relevant hints, so it’s plug-and-play. This is useful for scenarios where you need to quickly understand the main content of the video.
Generate API: This is a more flexible API that allows users to provide specific formats and styles as hints. Whether you’re generating a simple bulleted list, a more complex report, or even creative lyrics based on video content, the Generate API can handle it all.
Working together, these APIs enable Pegasus-1 to understand not only visual information in videos, but also audio and speech information to generate more comprehensive and accurate text descriptions. This series of APIs provides a comprehensive and flexible solution suitable for various application scenarios from simple video tag generation to complex video summary and description generation.
More AI News