Zhipu AI Unveils GLM-4-Plus: Speech-Vision Integration

3 min readAug 31, 2024

https://kcgod.com/glm-4-plus-by-zhipu-ai

Zhipu AI released its latest large base model GLM-4-Plus and demonstrated visual capabilities similar to the OpenAI GPT 4o model, capable of free voice calls and visual reasoning, and announced that it will be open on August 30!

Major Updates of GLM-4-Plus

Language base model GLM-4-Plus: The performance in language understanding, instruction following, long text processing, etc. has been comprehensively improved, maintaining the international leading level.
CogView-3-Plus, a Wensheng graph model, has performance close to that of the current best models such as MJ-V6 and FLUX.
Image/Video Understanding Model GLM-4V-Plus: It has excellent image understanding capabilities and time-aware video understanding capabilities. This model will be launched on the open platform (bigmodel.cn) and become the first general video understanding model API in China.
Video generation model CogVideoX: After the release and open source of version 2B, version 5B was also officially open sourced. Its performance was further enhanced, making it the best choice among current open source video generation models.

GLM-4-Plus has demonstrated excellent performance in many aspects, achieving significant improvements in language understanding, instruction following, long text processing, and many other aspects.

Functions and Features of GLM-4-Plus

Language comprehension and processing ability

Enhanced language understanding: GLM-4-Plus has improved its performance in language understanding, instruction following, and long text processing, and can better understand and process complex text tasks.
Long text processing: Through a more accurate long and short text data mixing strategy, the long text reasoning effect of GLM-4-Plus has been significantly improved, comparable to the international advanced level.
GLM-4-Plus is comparable to GPT-4o and Llama3.1 with 405B parameters in terms of language and text capabilities.

Model construction and data synthesis

High-quality synthetic data: GLM-4-Plus uses a large amount of model-assisted construction of high-quality synthetic data to improve model performance, especially in the performance of reasoning (such as mathematics and code algorithm questions), better reflecting human preferences.

Multimodal Capabilities:

Image and video understanding: GLM-4V-Plus, as an extension of GLM-4-Plus, has excellent image understanding capabilities and adds time-aware video understanding capabilities, which can understand complex video content and perform temporal reasoning.
Image and video generation: In conjunction with models such as CogView-3-Plus and CogVideoX, GLM-4-Plus can demonstrate superior performance in tasks such as image editing and video generation.

It can make smooth calls and respond quickly even if it is frequently interrupted. As long as the camera is turned on, Qingyan can also see what we see, and can understand the instructions and execute them accurately.

The video call function will be launched on August 30, and will be first available to some Qingyan users, and will also be open to external applications.

For more info ↓