沙利文联合头豹研究院发布《2025年中国大模型年度评测报告》

Currently, the development of large model technology in China has crossed the initial stage of large-scale competition, characterized by a 'hundred models competing' phase, and officially entered an industrial deepening period dominated by core technology breakthroughs and commercial value verification. At the industry landscape level, the general-purpose foundational large model track has completed its first round of market integration, reducing the number of core competitors from over a hundred to about twenty. This has formed a ternary competition system composed of internet giants, cloud service providers, and innovative enterprises in vertical fields, marking a strategic transformation from capital-driven to technology barrier-building in the industry. Compared with 2023, the comprehensive enhancement of large model multimodal capabilities in 2024 has significantly expanded application boundaries. This not only drives traditional manufacturers to increase investment but also attracts enterprises from vertical fields such as images and videos to cross-border participate, forming a two-way acceleration of technological innovation and market competition.

At the application level, the use cases of large models have broken through traditional limitations of dialogue assistants and basic content creation. They have deeply penetrated into professional fields such as autonomous driving, medical image analysis, and 3D character generation, demonstrating cross-industry commercial value. To comprehensively evaluate the technical strength and application progress of large models, Frost & Sullivan (Frost & Sullivan, abbreviated as 'Frost & Sullivan') and LeadLeo Research Institute have added an assessment of multimodal understanding and generation capabilities based on large language models, aiming to conduct an all-round evaluation of large models from two dimensions: language ability and multimodal ability.

Frost & Sullivan and LeadLeo Research will continue to monitor the latest developments in the field of large language models (LLMs) in China, providing objective and professional guidance and reference for the industry. The '2025 China Large Language Model Annual Evaluation Market Research Report', jointly released by Frost & Sullivan and LeadLeo Research, will delve into the key changes and achievements in the LLM field in 2024, offering insightful analysis and perspectives for the industry.

Core Insights of the Evaluation

Large Language Evaluation

The gap between Chinese large models and international counterparts is accelerating to convergence: The annual evaluation results for large models in 2025 show that the overall scores of leading Chinese large models have approached the international average, with the top eight Chinese models averaging scores almost on par with the top overseas models. Chinese large models have entered the global leading ranks in core capabilities, and the technological gap is rapidly narrowing.

Large models have become 'knowledge encyclopedia experts': The results of this evaluation show that all the participating large models have almost reached full marks in knowledge-based questions such as common sense and science, covering a variety of tests from basic common sense to advanced scientific problems. This indicates that current large models no longer have obvious weaknesses in knowledge acquisition and can undertake the role of 'knowledge encyclopedia experts'.

Deep reasoning and mathematics are important demarcations in model strength: evaluation data shows that there is the most significant gap in logical reasoning and mathematical abilities among large models. Under a 0-100 scoring system, the maximum score difference can be as high as 50 points. This phenomenon highlights that reasoning and mathematical abilities have become important demarcations in measuring the strength of large models.

The cost-effectiveness of Chinese large models far exceeds that of international ones: The evaluation data shows that while Chinese first-tier large models outperform international ones in overall scores, their reasoning and generation costs are much lower than those of overseas competitors. The average price per 1 million tokens of leading Chinese models is only 38.2 yuan, while the average price of international models is as high as 158.3 yuan, creating a nearly five-fold cost advantage and demonstrating the significant competitiveness of Chinese large models in terms of efficiency and cost-effectiveness.

Multimodal Evaluation

The overall multimodal understanding ability is still in the development stage, with an accuracy rate of less than 80% for recognition: In the evaluation of multimodal understanding ability, the overall recognition accuracy rate of all participating models for various images and types did not exceed 77%, and the performance of the best model did not reach 85%, indicating that there is still significant room for improvement in recognition accuracy in practical applications of current multimodal understanding.

The core challenge in multimodal understanding is object localization: among the nine sub-dimensions of multimodal understanding, the recognition accuracy of the object localization dimension is the lowest, with an average accuracy of only 44.3%. Precise object localization remains a key bottleneck in current multimodal understanding technology.

The model's artistic creation ability is significantly superior to its commercial creation ability: According to the evaluation results of this multimodal generation task, all models scored 74.3 out of 100 in artistic creation, while the average score for commercial creation was 69.5, indicating that the model performs well when meeting aesthetic and creative needs, but further optimization is still required in terms of accuracy and adaptability to commercial application scenarios.

The main shortcoming of multimodal generation refers to command-following and text generation: Currently, multimodal generation faces two major issues: firstly, models frequently deviate from following commands, resulting in images that do not quite match the requirements; secondly, most models cannot generate text accurately. These issues significantly limit the feasibility and development potential of multimodal technology in a wider range of application scenarios.

Evaluation Background and Methodology

Overview of Participants

This evaluation includes two parts: large language evaluation and multimodal evaluation. In both parts, the understanding tasks for large language and multimodal tasks are conducted by calling model APIs, while multimodal generation utilizes a web service port.

In the large language model section, the Chinese models evaluated this time include Doubao, Wenxin Yiyuan, Zidong Taichu, BaiChuan Intelligence, Xunfei Spark, Tencent Hanyuan, Kimi.ai, 360 Brain, Zhipu AI, Zero-One Everything, Minimax, Deepseek, Tongyi Qianwen, Shangtang Rixin, Jieyue Xingchen, and Shusheng. These models represent the mainstream large language models in the current Chinese market. Meanwhile, internationally, OpenAI's GPT-4o, GPT-4o-mini, GPT-o1, Gemini 2.0, and Claude 3.5 were selected, representing the world's top level and providing an important benchmark for Chinese large language models.

In the multimodal evaluation section, due to differences between models and teams, the evaluation process has been further subdivided into two parts: multimodal understanding and multimodal generation. In the multimodal understanding category, the shortlisted companies include SenseTime Technology, Alibaba Cloud, Tencent Cloud, LeapSecond, Zhipu AI, iFlytek, ByteDance, FaceID Intelligence, Minimax, Zero-One Everything, and DeepSeek. In the multimodal generation category, the shortlisted companies include SenseTime Technology, Alibaba Cloud, Tencent Cloud, LeapSecond, Zhipu AI, iFlytek, ByteDance, Douyin, Kuaishou, 360, and TianGong AI. These represent the leading models in the field of multimodal AI in China at present.

Source: Frost & Sullivan, LeadLeo Research Institute

Dimension selection

This large model evaluation is centered around the actual application experience and value of users. By conducting in-depth analysis of various real-world usage scenarios and establishing a scientific and systematic evaluation framework, it comprehensively and objectively measures the performance of each model in terminal applications.

In the evaluation of large language models, the overall evaluation system includes five core primary dimensions: mathematical and scientific abilities, language proficiency, moral responsibility, industry-specific capabilities, and comprehensive abilities. These are further broken down into multiple detailed secondary dimensions such as risk information recognition, logical reasoning, analogical transfer, and role-playing, to more accurately reveal the model's ability performance and limitations in different task scenarios.

Source: Frost & Sullivan, LeadLeo Research Institute

In the multi-modal model evaluation section, it is mainly divided into two core primary dimensions: multi-modal understanding and multi-modal generation. Within the multi-modal understanding dimension, it is further refined into nine detailed dimensions: object recognition, image logic, image emotion, object localization, etc.; the multi-modal generation dimension is further divided into commercial creation and artistic creation based on application characteristics. Through the above detailed dimension design and evaluation criteria, the aim is to comprehensively and deeply demonstrate the capabilities, advantages, and areas for improvement of each multi-modal model in the fields of understanding and generation.

Source: Frost & Sullivan, LeadLeo Research Institute

Results of the Chinese Large Model Capability Evaluation

● Tongyi Qianwen

Alibaba Cloud released its flagship model Qwen 2.5-Max in January 2025. It adopts the MoE hybrid expert architecture and has a pre-trained dataset size exceeding 20 trillion tokens. In multiple benchmark tests, its performance has comprehensively surpassed DeepSeek-V3, GPT-4o, and Llama-3.1-405B. This model is particularly adept at mathematical reasoning and code generation, with its mathematical ability surpassing that of GPT-4o. The open-source version Qwen1.5-110B even topped the HuggingFace leaderboard. Through Alibaba Cloud's Babel platform, users can call its API, and the enterprise version offers extremely high cost-effectiveness, with costs reduced by 84% compared to the industry after discounts.

●Shang Tang is making progress every day

The Shangtang Daily New SENOVAB-5.5-pro integrated large model adopts a native fusion multimodal technology approach, unifying large language models and multimodal large models. In the pre-training phase, fusion multimodal data is synthesized through methods such as massive interleaved text and image data and inverse rendering to establish an interactive bridge between text and image modalities. In the post-training phase, cross-modal task enhancement training is constructed, including video interaction and multimodal document analysis, to stimulate the model's ability to integrate and analyze multimodal information, paving the way for deep reasoning and multimodal information integration.

● Tencent Mhyeon

Tencent Cloud released hunyuan-turbo-latest in October 2024, which continues to be optimized through a dynamic update mechanism (iteration every two weeks). Its core advantage lies in reducing the illusionary proportion by 30%-50% through the “TrueView” algorithm and enhancing trap recognition capabilities through reinforcement learning to ensure output security. In addition, the model supports ultra-long text generation (optimized position encoding) and thought chain reasoning strategies, simulating human step-by-step decision-making logic, suitable for complex task processing.

Frost & Sullivan in collaboration with LeadLeo Research has released the '2025 China Large Model Annual Evaluation Report'

获取白皮书

联系我们