Alibaba’s Qwen2-VL: Long-Form Video Understanding

6 min readAug 31, 2024

Alibaba Cloud released the latest visual language model version of Qwen2-VL, which is significantly improved over its predecessor Qwen-VL.

Qwen2-VL has advanced understanding capabilities for multi-resolution and scale images, and performs well on multiple visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA.

In addition, Qwen2-VL can understand video content of more than 20 minutes and support complex reasoning and decision-making, enabling it to perform automated operations with mobile devices, robots, etc.

The model also adds multi-language support and can understand image text including most European languages, Japanese, Korean, Arabic, etc.

Model Size of Qwen2-VL

The models released this time include the open source Qwen2-VL-2B and Qwen2-VL-7B, as well as the API of Qwen2-VL-72B.

Qwen2-VL-72B: As the largest model in the model family, it performs well in most evaluation indicators, especially in document understanding.

Qwen2-VL-7B: Provides cost-effective competitive performance while retaining support for image, multi-image, and video inputs. The model performs well on document understanding tasks such as DocVQA and understanding multilingual text from images (evaluated by MTVQA), establishing state-of-the-art performance.

Qwen2-VL-2B: A smaller 2B model optimized for potential mobile deployment. Despite its small size, this model has strong performance in image, video, and multi-language understanding. Compared to other models of similar size, it performs particularly well in video-related tasks, document understanding, and general scenario question answering.

Key Features and Highlights of Qwen2-VL

Enhanced recognition capabilities

Object Recognition: Qwen2-VL improves the ability to recognize multiple objects in complex scenes, not just plants and landmarks, but also understands the complex relationships between multiple objects.
Text Recognition: Significantly enhanced the recognition capabilities of handwritten text and multiple languages, enabling it to recognize text in multiple languages in images, including most European languages, Japanese, Korean, Arabic, etc.

Visual Reasoning

Problem Solving Skills: Qwen2-VL has significantly improved his math and coding skills, and is able to solve complex math problems through graphical analysis, and can correctly interpret even images with extreme scale distortion.
Information Extraction: Models can extract information from real-world images and diagrams and have better ability to follow instructions, solve practical problems, and combine abstract concepts with concrete solutions.

Visual Agent Capabilities

Function Invocation: Qwen2-VL demonstrates strong potential as a visual agent, capable of invoking external tools by interpreting visual cues to obtain real-time data such as flight status, weather forecast, or package tracking.
User Interface Interaction: By allowing the model to interact with visual stimuli, Qwen2-VL pushes AI’s perception capabilities to a new level, making it not just an observer but an active participant in the visual experience.

Performance of Qwen2-VL

Qwen2-VL is evaluated on multiple key dimensions of visual capabilities, demonstrating superior performance, especially in the following aspects:

Complex University-Level Problem Solving

Qwen2-VL demonstrated strong ability in solving complex mathematical problems and logical reasoning, and was able to cope with high-level academic and practical problems.

Document and table comprehension

In document understanding tasks such as DocVQA (Document Visual Question Answering), the Qwen2-VL 72B model performed particularly well, surpassing many closed-source models (such as GPT-4o and Claude 3.5-Sonnet) and demonstrating top performance.

Multilingual Text-Image Understanding

Qwen2-VL performs well in multilingual text-image understanding tasks, especially in the MTVQA (Multilingual Text Visual Question Answering) task, achieving industry-leading performance levels.

General scenario questions and answers

In the general scenario question-answering task, Qwen2-VL demonstrated strong understanding and answering capabilities and adapted to a variety of complex scenarios.

Video Understanding

Qwen2-VL has a very strong ability to understand video content, can process videos longer than 20 minutes, and demonstrates excellent performance in video-related tasks.

Agent Interaction Capabilities

Qwen2-VL has the ability to perform complex interactions with devices (e.g., mobile devices, robots), supports automated operations, and performs well in a variety of interactive tasks.

Model Architecture of Qwen2-VL

Qwen2-VL inherits the architectural design of Qwen-VL and makes several key improvements on this basis to enhance its visual and language processing capabilities, especially in the processing of image and video input. The following are the main architectural features of Qwen2-VL:

Visual Transformer (ViT) Model

Qwen2-VL uses a Visual Transformer (ViT) model with approximately 600M parameters, which is specifically designed to process image and video inputs. The use of the ViT model enables Qwen2-VL to effectively perceive and understand visual information and adapt to various input types, including static images and dynamic videos.

Naive Dynamic Resolution supports

Qwen2-VL introduces Naive Dynamic Resolution technology, which allows the model to process images of any resolution. This technology ensures the consistency between the model input and the inherent information in the image by mapping the image into a dynamic number of visual tokens. This method is closer to human visual perception and can process images of any clarity or size.

Multimodal Rotational Position Embedding (M-ROPE)

The architecture innovatively introduces Multimodal Rotary Position Embedding (M-ROPE), which decomposes the original rotation embedding into three parts, representing time and space (height and width) information respectively. M-ROPE enables Qwen2-VL to simultaneously capture and integrate the position information of one-dimensional text, two-dimensional vision and three-dimensional video, significantly enhancing the multimodal processing capabilities of the model.

Multimodal fusion and reasoning

Qwen2-VL achieves efficient cross-modal reasoning by combining the capabilities of visual transformers and language models when processing multimodal data (such as text, images, and videos). This fusion enables the model to perform multi-level understanding and analysis in complex scenarios.

Open Source and API Integration

The Qwen2-VL-2B and Qwen2-VL-7B models are both released under the Apache 2.0 open source protocol and integrated into third-party frameworks such as Hugging Face Transformers and vLLM, making it easier for developers to call and deploy models. The Qwen2-VL-72B model is available through an API and is suitable for application scenarios that require greater model capabilities.

Official introduction: https://qwenlm.github.io/blog/qwen2-vl/

GitHub: https://github.com/QwenLM/Qwen2-VL

Model download: https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d

Online demo: https://huggingface.co/spaces/Qwen/Qwen2-VL

API: https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api

For more info ↓