沙利文发布《2025年中国合成数据解决方案发展洞察》

Amid the wave of accelerated penetration of artificial intelligence into various industries and continuous breakthroughs in generative technology, synthetic data is evolving from an auxiliary tool to a core element driving the large-scale implementation of AI. Its generation quality, modal richness, and compliance value are improving simultaneously, pushing synthetic data solutions towards higher fidelity, multi-modal, and trustworthy directions.

Against this backdrop, Frost & Sullivan (hereinafter referred to as 'Frost & Sullivan') has conducted an in-depth analysis of China's synthetic data solutions and hereby publishes 'Insights into the Development of Synthetic Data Solutions in China 2025' (hereinafter referred to as the 'White Paper'). The purpose of this White Paper is to sort out the development history, current status, core value, industrial chain map of synthetic data solutions, as well as their global market scale and regional penetration, and to explore the future development trends of synthetic data solutions.

This white paper focuses on Synthetic Data solutions, analyzing their current development status, technical paths, market landscape, and future trends. Synthetic Data solutions can systematically address multiple data bottlenecks in AI, evolving from simple substitution to a core strategic asset, and demonstrating tremendous value potential in autonomous driving, embodied intelligence, and industrial scenarios. Synthetic Data solutions provide high-quality, highly available, low-cost data sources for model training and development as well as the implementation of AI applications. This white paper aims to provide valuable reference information for researchers, developers, and enterprises in related fields, promoting technological progress and industrial development.

Synthetic data solutions have become a core strategic asset in the AI era

Against the backdrop of rapid development in artificial intelligence technology and continuous deepening of digital transformation, AI research and implementation place higher demands on the scale, quality, and diversity of data. Traditional data acquisition and processing methods face multiple bottlenecks such as cost, privacy, and scarcity, which drives the continuous evolution of synthetic data technology. Specifically, the development stages of synthetic data are as follows:

1.0 A Companion Tool for Filling in GapsIn the early stages of synthetic data development, there were challenges such as a scarcity of real data, high acquisition costs, or privacy compliance issues. This phase mainly focused on random distribution, statistical sampling, and mechanism simulation, primarily generating structured data such as tables. However, the generation efficiency was only 30% of that of real data collection, and it could not reflect multi-variable dynamic interactions.
2.0 Key Components for AI ImplementationWith breakthroughs in generative model technologies such as GAN and VAE, synthetic data formats have expanded to include voice, image, and video, and are widely used in multiple fields such as image recognition, autonomous driving, and biomedicine. At the same time, the increasing demand for privacy and compliance has driven synthetic data to become an important component of AI implementation.
3.0 Core Strategic Assets Driving AI TransformationThe breakthroughs in large models and generative AI are driving the AI paradigm from 'model-centric' to 'data-centric', demonstrating its tremendous value potential in addressing the data issues of large model training and embodied intelligence evolution. In the face of the gradual depletion of high-quality text resources on the internet, synthetic data has become the 'renewable fuel' for large model training, widely used by companies such as OpenAI, Meta, and NVIDIA in the pre-training and alignment stages. At the same time, synthetic data expands human action samples by a thousand times through high-fidelity physical simulation, effectively alleviating the severe shortage of physical interaction data in embodied intelligence training and helping to achieve zero-shot generalization capabilities for robots.

Source: Analysis by Frost & Sullivan

The market scale and penetration rate of synthetic data solutions in China are growing rapidly.

The global synthetic data market is experiencing explosive growth. The market size has rapidly expanded from 1.18 billion yuan in 2021 to 4.76 billion yuan in 2025, with a compound annual growth rate of up to 41.8% during this period. Driven by multiple factors such as accelerated AI technology iteration, heightened data security requirements, and the prominence of cost-effectiveness advantages, the market is expected to maintain a strong growth momentum. The compound annual growth rate from 2025 to 2030 is expected to reach 33.8%, and by 2030, the global market size will exceed 20 billion yuan.

Thanks to its mature technology ecosystem, strict data regulations, and early active corporate adoption, global synthetic data solutions have the highest penetration rates in North America and Europe.The Chinese market has the fastest growth rateDriven by a massive internet user base, rich implementation scenarios, and strong policy support. The penetration rate in other regions of Asia-Pacific and emerging markets is currently relatively low, but there is tremendous growth potential.

Source: Analysis by Frost & Sullivan

The use of synthetic data in AI models is expected to exceed that of real data by 2030.

With the widespread application of synthetic data in AI training and inference, data paradigms are evolving towards a 'human-in-the-loop' hybrid data model. It is expected that by 2030, the generation of synthetic data in AI models will exceed the use of real data. Emerging technologies will completely transform the generation of synthetic data, achieving higher authenticity, scalability, and efficiency. This will drive the transition of synthetic data from 'static replication' to 'dynamic evolution', significantly enhancing its authenticity, scalability, and efficiency. The evolution of advanced AI models realizes cross-domain hyper-realistic data synthesis while quantum computing optimization algorithms accelerate large-scale data generation. Digital twins integrate high-fidelity simulations of real systems and environments for predictive modeling and edge scenario testing.

Current industrial AI heavily relies on high-cost real data, which will shift in the future to “1% human data + 99% efficient synthesisThe hybrid mode of ” relies on the “Human in Loop” mechanism, with domain experts involved in screening, rule definition, and quality assessment to build a broader and more dynamic and reliable data pool, supporting highly reliable AI training.

Source: Analysis by Frost & Sullivan

Synthetic data is the key foundation for physical-driven application scenarios of entities.

The application of synthetic data can be mainly divided into two major driving types:Physical entity-driven and information data-drivenPhysical-driven entities are suitable for industries that highly depend on real physical environments and multimodal interactions. These scenarios typically feature complex interactions, difficult real data collection, and scarce long-tail scenario data. The core value of synthetic data in such applications lies in its ability to simulate physical laws and real environments, effectively covering extreme cases and long-tail situations, and supporting the training and validation of highly reliable systems. Typical industries include autonomous driving, embodied intelligence, and the industrial sector. Information-driven entities focus on areas with extremely high requirements for data privacy, compliance, and sensitivity. These scenarios generally face challenges such as strict privacy protection requirements, limited data sharing, and strong compliance constraints. Synthetic data helps institutions achieve data sharing and virtual environment expansion while protecting user privacy by generating logically reasonable and statistically characteristic alternative samples. Typical application industries include finance, healthcare, and gaming.

Source: Analysis by Frost & Sullivan

Precise simulation of physical interactions and rich, high-quality synthetic data are key for enterprises transitioning from autonomous driving to embodied intelligence.

When enterprises transition from autonomous driving to embodied intelligence, their core challenge is the fundamental shift from a closed scenario with relatively clear rules focused on 'mobility' to an open-world problem centered on 'interaction'. Enterprises need to overcome the cognitive gap from third-person environmental observation to first-person embodied interaction, handle the complexity of data from pure vision to multimodal physical interaction, and achieve an upgrade in capabilities from specific tasks to general cognition. Therefore, synthetic data has become the core infrastructure for solving the training challenges of embodied intelligence. Enterprises need to focus on supplementing the following core capabilities, starting withExpand the scope of synthetic data capabilitiesIt covers multimodal interactions such as touch and force feedback, as well as dynamic Agent learning. Secondly, a high-fidelity simulation environment is constructed, leveraging a high-quality physics simulation engine to achieve effective Sim-to-Real migration. FinallySupports annotating high-level semantic dataIncluding relationship reasoning and causal scenario explanations, to bridge the cognitive and behavioral gap.

Source: Analysis by Frost & Sullivan

Among synthetic data providers, solutions-oriented ones demonstrate stronger scalability and commercial potential.

The upstream chain of the synthetic data solution industry includes two major support areas: hardware and software. Hardware encompasses sensors and chips. Sensors determine the precision and reliability of real data collection, while chips are the computational foundation for ensuring simulation and data generation. The software part, including data management, data annotation, and data security, constitutes the governance foundation for synthetic data.

The competitive characteristics of synthetic data solutions in the midstream of the industrial chain lie in rapid technological iteration, high industry Know-how barriers, and stringent ecological compatibility requirements. These three aspects determine whether suppliers can achieve cross-industry migration and large-scale implementation. They determine the ability to cope with complex and rapidly evolving industry scenarios; affect the migrability and depth of implementation of solutions; are related to scale-up and commercialization, and are even more crucial for supply chain security and stability. The overall pattern can be divided into three categories:

Solution-focused providers:

Deep belief in innovation and technology: Providing integrated toolchains and hardware-software solutions for industry users, expanding into the fields of embodied intelligence and industry. Through experience loop closure and continuous learning, generating synthetic data with high physical fidelity and precision, supporting real-time coupling optimization of scenarios and algorithm feedback.
Guanglun Intelligence: Based on upstream software, it innovates the 'Real2Sim2Real + Reality Validation' architecture. It emphasizes the combination of human-in-the-loop and simulation, highlights the complementarity between real and synthetic data, and at the same time provides a platform for authenticity evaluation and utility evaluation.

Hardware Driver Provider:

NVIDIA: Relying on GPU hardware and the CUDA ecosystem, it extends downwards to include simulation, data generation, and model training, building end-to-end solutions. However, it is highly dependent on hardware and lacks flexibility.

Simulation platform provider:

Songying Technology supports GPU hardware deployment from different manufacturers and features distributed multi-GPU collaborative computing. It meets the real-time processing efficiency requirements of simulation platforms and large-scale application scenarios, and builds capabilities for simulation and virtual training environments.

Driven by the rapid development of generative AI and digital transformation, the demand for data in vertical industries downstream of the industrial chain is becoming increasingly prominent, and the potential for large-scale implementation is accelerating. After practical refinement, technological iteration, and integration, synthetic data solutions are seeing broad commercialization and application prospects under this trend.

Source: Analysis by Frost & Sullivan

Frost & Sullivan releases 'Development Insights into Synthetic Data Solutions in China 2025'

获取白皮书

联系我们