Frost & Sullivan releases the '2025 White Paper on the Development of World Models in China'

Frost & Sullivan releases the '2025 White Paper on the Development of World Models in China'

Published: 2025/09/23

沙利文发布《2025年中国世界模型发展白皮书》

World models are entering a critical transition period towards complex intelligent behavior generation, becoming a key infrastructure for promoting the integration of physical AI with virtual worlds, and helping China to take a leading position in global AI competition. Currently, world models in the field of autonomous driving are moving from research and development testing towards mass production empowerment. By generating massive amounts of high-fidelity scenarios, they drive continuous learning, autonomous verification, and rapid iterative optimization of autonomous driving systems, facilitating the implementation of L3/L4 systems and significantly reducing the cost and time of in-vehicle testing. In the field of embodied intelligence, world models serve as synthetic data engines, breaking through the bottleneck of scarce physical interaction data, providing efficient and safe virtual training environments for robots, and accelerating their adaptation to real-world tasks. Both applications highlight the core value of world models in promoting the closed-loop evolution of AI from perception to action through simulation and generation.

 

This white paper focuses on the cutting-edge artificial intelligence technology of 'World Models', analyzing their current development status, technical paths, market patterns, and future trends. World models are generative AI models that understand the dynamics of the real world (including its physical and spatial attributes). They use input data such as text, images, videos, and motion to generate videos. By learning, they can understand the physical characteristics of real-world environments, thereby representing and predicting dynamics such as motion, stress, and spatial relationships in sensory data, accelerating virtual world generation in physical AI, generating scalable augmented data, thereby eliminating data bottlenecks, and achieving more efficient basic model training. The research purpose of this white paper is to comprehensively sort out the development history, current status, core technologies of world models, and their applications in intelligent driving and embodied intelligence. By comparing and analyzing the capabilities of different manufacturers, it explores the future development trends of world models.

 

 

 

 

PART.01

The World Model 2025: The AI Leap from Perceiving Reality to Deciding the Future

 

World Models, as a generative AI model, are centered around understanding the dynamic laws of the real world (including physical characteristics and spatial attributes) by constructing internal representations, and generating video content with multimodal inputs (text, images, videos, motion data, etc.). They achieve an understanding of the physical attributes of real environments and simulate, guide, and implement decisions by generating environments and actions. Li Feifei, founder of World Labs and professor at Stanford University, pointed out, "World models should not only perceive and model the real world but also possess the ability to foresee possible future states, thereby providing guidance for decision-making." However, in terms of development and current status, world models are still in their early stages, mostly focusing on simulation and compression at the perceptual level, and have not yet truly achieved a stable closed-loop integration of "perception-prediction-decision-making." Although pilot applications have been made in the field of autonomous driving, they rely heavily on specific environments and strong priors, lacking generality and long-term generalization capabilities. Future development directions will concentrate on three aspects: first, enhancing understanding of world states through multimodal inputs; second, introducing causal modeling and controllable generation mechanisms to improve prediction accuracy and behavioral planning capabilities; third, deeply integrating world models with embodied intelligence systems to achieve a leap from "observing the world" to "understanding and participating in the world."

Data source: Frost & Sullivan analysis, LeadLeo research institute

 

 

PART.02

Different world model vendors, based on their own strategies and technical strengths in different dimensions, have developed unique world model capabilities and related products.

 

The technical capability of world models is built upon four pillars.Causal ReasoningEnable AI to answer hypothetical questions such as 'What will happen if A occurs?' and understand the deep causal relationships between actions and outcomes, thereby enhancing autonomous decision-making capabilities in dynamic environments.Spatiotemporal Consistency: Solves the problems of object distortion and deformation in traditional video generation. The world model is achieved throughLong-term memory mechanisms, latent space modeling, object-centered representationTechnologies such as [X] are used to maintain spatial structure stability and reasonable temporal evolution at a higher dimensionality, generating stable and coherent video sequences.Multi-mode data Physical Rule DescriptionIt is designed to simulate complex physical phenomena such as fluid motion and object collisions. The world model predicts behavior that follows basic 3D geometry and physical rules.3D scene structureRather than simple pixels, this avoids the 'dreamlike' sense of unreality, laying the foundation for subsequent interactions.Execution & Real-time FeedbackBy combining with reinforcement learning, it realizes “perception→ modeling→ planning→ execution→ perception update→ model revision”dynamic loopLow-latency real-time feedback is the foundation of practical applications, which can be achieved through lightweight technology and latent space state generation.

 

The industry typically uses metrics such as FID, FVD, frame rate, duration, and consistency to quantitatively evaluate its performance. Different world model vendors create unique world model capabilities and related products based on their own strategies and technical advantages across different dimensions. Currently, the technical paths are mainly divided into generative and non-generative categories. International vendors such as NVIDIA (COSMOS), Google (Genie3), and Meta (V-JEPA2) have launched leading models.Shang Tang (Jueying Kaiwu)With its innovations such as the 'first high-resolution and sparse control multi-view world model', it competes with these international giants in technical benchmark comparisons and has become a representative vendor of platform-empowering solutions.

Data source: Frost & Sullivan analysis, LeadLeo research institute

 

 

PART.03

Currently, over 80% of autonomous driving algorithms use world models for assisted training. World models drive autonomous driving systems to continuously learn, autonomously verify, and rapidly iterate and optimize.

 

Currently, over 80% of autonomous driving algorithms use world models for assisted training. World models can transform 'high dynamic + high uncertainty' scenarios that traditional algorithms find difficult to cover into controllable problems by generating scenes that combine multiple layers of complex elements, helping autonomous driving systems achieve dual upgrades in product performance and market performance. On one hand, world models can quickly generate massive amounts of high-fidelity scenes, covering long-tail and extreme events, significantly enhancing system robustness and safety assurance. On the other hand, world models replace real road testing with efficient simulation, eliminating the need for expensive annotation and map data. This not only reduces R&D costs but also promotes rapid product iteration and market expansion. By constructing a closed-loop feedback mechanism of 'real data & model training & simulation scenario verification & model deployment', and providing a unified representation of potential world states, they provide a consistent cognitive context for modules such as perception, prediction, planning, and control. Therefore, world models can drive autonomous driving systems to continuously learn, autonomously verify, and rapidly iterate and optimize, thereby significantly improving end-to-end autonomous driving performance. World models are an accelerator to break through the bottleneck of large-scale deployment of L4 (such as Robotaxi), and are the key foundation for autonomous driving agents to move towards human-like cognitive and judgment modes.

Data source: Frost & Sullivan analysis, LeadLeo research institute

 

 

PART.04

The World Model is the core engine that reshapes the development paradigm of embodied intelligence, providing high-quality, low-cost, and scalable synthetic data generation paths for embodied intelligence. It addresses current data bottlenecks, and in the future, the World Model will become the 'cognitive core' of embodied intelligence.

 

Embodied intelligence represents the shift of AI from pure information processing to interaction with the physical world. The core pain point lies in the 'thousands of times' gap in physical interaction data, exceeding 99%. The data required for embodied intelligence needs to integrate multi-dimensional signals such as text commands, multi-view vision, joint motion trajectories, and physical interactions, far more complex than pure text or a single visual modality. Collecting real physical interaction data is time-consuming and costly, severely lagging behind the speed of technology development. World models can generate visually realistic and physically accurate synthetic data, effectively overcoming the discrepancy between traditional simulation data and the real world. At the same time, they can significantly reduce the time and economic costs of data acquisition and easily scale data size. Training with massive, diverse synthetic data generated by world models can significantly enhance the adaptability and task execution success rate of embodied intelligence models in unknown environments.

 

Currently, the application maturity of world models in the field of autonomous driving is higher than that in the domain of embodied intelligence. In the future, world models will become the 'cognitive core' of embodied intelligence. World models not only provide data support for embodied intelligence but are also reshaping its entire development paradigm. As a platform,Prediction and Generation EngineSeamlessly integrates the entire process from data synthesis, algorithm training to simulation verification, forming an efficientclosed-loop iterative systemBy providing an integrated toolchain, it eliminates the complex engineering obstacles of building infrastructure on your own, allowing developers to focus on algorithm and application innovation, therebySignificantly improve R&D efficiencyProvides for the entire process of “Perception-Decision-Execution”Secure, Explainable Closed-loop VerificationBy precisely simulating physical interactions, it systematically enhances the adaptability and reliability of intelligent agents. It deeply integrates with the development toolchain.Eliminated efficiency losses caused by traditional fragmented processesIt supports efficient development, training, and performance optimization of mainstream models.

Data source: Frost & Sullivan analysis, LeadLeo research institute

 

 

PART.05

Case Study: The comprehensive capabilities of SenseTime's 'OpenMind World Model' rank at the forefront among independent third parties and OEMs

 

The comprehensive capabilities of SenseTime's "Awakening" world model are at the leading level among independent third parties and OEMs, comparable to world-leading world model vendors. In terms of intelligent driving, SenseTime provides low-cost massive simulation data and extreme scenario coverage for autonomous driving manufacturers, helping accelerate training iterations and mass production implementation. It has jointly built an end-to-end data factory with Zhiji Auto, generating high-risk long-tail scenarios to supplement training and validation data, significantly accelerating the mass production implementation of intelligent driving. Additionally, it supports the entire link from data to model deployment at the Shanghai Autonomous Driving Training Field, generating multi-view simulation data on a large scale, reducing data costs and improving R&D efficiency.

 

In addition, Tencent Smart Enterprise has built a core engine based on the Tencent Smart Enterprise "Enlightenment" world model.Eneng AI Personalized Intelligent PlatformAchieves visual perception, precise navigation, and multimodal interaction. With the support of on-device and cloud computing power, it enables intelligent agents to autonomously understand and act in real environments. This platform isfirstSupports multi-view world models with high resolution and sparse control, breaking through the bottleneck of embodied intelligence data synthesis technology.LeadingThe synthetic data capability supports controllable scene diversity generation in one-chain scenarios, controllable coupling of scene elements, arbitrary editing, and 3D technology-controlled generation of realistic trajectories.

Data source: Frost & Sullivan analysis, LeadLeo research institute

 

 

获取白皮书

沙利文发布《2025年中国世界模型发展白皮书》

×
请选择职位类别
请选择
×
联系我们
联系我们
电话

业务咨询热线

(021)54075836

微信
二维码

扫码关注官方微信公众号

返回顶部
返回顶部

联系我们

×
请选择职位类别
请选择
×