湖仓协同全面赋能数智融合，大数据领域步入应用创新深化期

Frost & Sullivan, in collaboration with LeadLeo Research Institute, released the '2021 China Data Management Solutions Market Report'. The report focuses on data warehouse, data lake, and intelligent lakehouse series products as its core research subjects, covering the entire year of 2021. This research project will systematically sort out market trends, cutting-edge technologies, enterprise needs, competitive dynamics, etc., of data management solutions in fields such as finance, internet, retail, entertainment, telecommunications, energy, logistics, transportation, manufacturing, energy, healthcare, and government affairs. It will also make speculations or predictions about market development prospects from the dimensions of value creation and technological development.

At the same time, we measure the comprehensive competitive strength of industry enterprises in 2021 from multiple dimensions such as storage, data preparation, machine learning, data analysis, process orchestration, compatibility, query and computing performance, disaster recovery construction, service support, open source and industrial chain ecosystem, and data service scenario solutions. Frost & Sullivan, in collaboration with LeadLeo Research Institute, will continue to monitor the Chinese data management solution market and capture competitive dynamics.

The integration of HUCAM eliminates the difficulty for users to select models, providing them with a data management platform that combines the structural and governance advantages of a data warehouse with the scalability of a data lake and the convenience for machine learning.

The literal understanding of Big Data is massive data, but this perspective is abstract. In the era of network information, the objective significance of big data does not lie in its grand scale of data, but in how data is professionally stored and processed, and from which knowledge value can be mined and extracted.

Technological breakthroughs usually originate from the market's substantial demand for products. The continuous development of the internet, cloud computing, AI, and the integration with big data technology have met business needs. In the big data industry, reducing storage costs, improving computing speed, multi-dimensional analysis and processing of data, and empowering enterprises to utilize data value are key to achieving profitability in the industry and also the root cause of the vigorous development of big data technology.

The connotation of big data technology has evolved continuously with the development of traditional information technology and data applications. The core of the big data technology system has always been basic technologies for storing, computing, processing, and other operations on massive amounts of data.

Over the more than 60 years of development of big data technology, data applications have undergone vigorous development and demand changes driven by the internet and mobile internet. Traditional strengths such as transaction analysis processing-based databases and data warehouses remain the backbone of current information technology, but they are increasingly struggling to keep up with the growing complexity of data and the massive, elastic scale of data.

The breakthrough in distributed architecture and the rise of cloud computing have laid the foundation for the concept of data lakes. The integration of lake and warehouse further eliminates user selection difficulties, providing users with a data management platform that combines the structural and governance advantages of data warehouses with the scalability of data lakes and the convenience for machine learning.

Data warehouses and data lakes, as two separate data management paradigms, both possess mature technical accumulations. In long-term practice, they coexist in a hybrid architecture of lake + warehouse: the data lake is used for extracting and processing raw data, while relying on the data warehouse for publishing data pipelines.

In user feedback, the hybrid architecture of Hadoop and MPP has encountered difficulties in data redundancy when coexisting, low timeliness due to ETL between the two systems, consistency assurance, and operational maintenance.

Driven by user needs, data lake and data warehouse providers have expanded beyond their original paradigms to a limited scope, gradually forming two 'lake-warehouse integration' paths: 'building warehouses on lakes' and 'moving from data warehouses to lakes'. Although the underlying logic of lake-warehouse integration is still a binary system, it can greatly help users encapsulate a larger big data paradigm that is closer to their original IT foundation, or directly mount a fully managed service-based lake-warehouse integration system.

The performance of the data warehouse itself and ETL depends on communication, I/O capabilities, and hardware performance. The execution architecture determines the supporting capacity of the data warehouse.

Databases focus on OLTP, while data warehouses focus on OLAP. Data warehouses are traditional relational databases such as SQL Server and Oracle, which can become excellent data warehouse entities after strict data model design or parameter adjustment. However, pure data warehouses like Terradata and SybaseIQ are not suitable for adapting to OLTP systems.

In the trend, OLAP and OLTP are converging into HTAP. The enhancement of database's AP analysis capabilities will gradually blur the boundaries between databases and data warehouses.

The Hadoop architecture (MapReduce model) is suitable for massive data storage and query, batch data ETL, and unstructured data analysis; while the MPP architecture is suitable for replacing large-scale data processing under existing relational data structures, conducting multi-dimensional data analysis, and data market operations.

In a hybrid architecture, MPP processes high-quality structured data while providing SQL and transaction support. Hadoop is used for semi-structured and unstructured data processing. Through this hybrid approach, it automatically meets the needs for efficient processing of structured, semi-structured, and unstructured data, solving the difficulties of traditional data warehouses in loading large volumes of data slowly, having low data query efficiency, and being unable to integrate multiple heterogeneous data sources for analysis. This solution, which breaks down the boundaries between data warehouses, has become a mainstream architecture method. However, in the process of lakehouse integration, more emerging architectures are being developed and validated, or there may be a new generation of architectures that will replace the MPP-Hadoop architecture as the preferred solution in the future.

Data lakes have developed various architecture methods for real-time data processing development. The most representative ones are Lambda, Kappa, and IOTA architectures.

Data lakes started with Lambda architecture to achieve the integration of offline and real-time computing. The Kappa architecture unified data calibers and simplified data redundancy. The IOTA architecture eliminated ETL through edge distribution and unified data models, further accelerating the efficiency of data lakes.

Other data lake architectures include the Omega architecture developed by EvenTech, which consists of a stream processing system and real-time data warehouse. It integrates the advantages of Lambda and Kappa architectures for stream data processing, enhancing the capabilities for real-time on-demand intelligent processing and offline on-demand intelligent processing, as well as efficient handling of real-time snapshots of mutable data.

With the popularization of data intelligence service awareness, it is particularly crucial for vendors to seamlessly integrate data analysis services with machine learning services to provide more intelligent and user-friendly product services for users without an AI algorithm background, such as data R&D and analysts.

Products such as databases, data warehouses, data lakes, and lakehouse integrated solutions are data infrastructure. How to adopt data analysis tools and drive decision-making can transform data into value. Artificial intelligence and machine learning capabilities are important features that endow lakehouse integrated services with innovative capabilities.

Data Intelligence is based on big data. It involves processing, analyzing, and mining massive amounts of data through AI to extract information and knowledge from the data. By establishing models, it seeks solutions to existing problems and enables predictions, thereby helping decision-making.

In the past, BI was the main application scenario for data warehouses in statistical analysis computing, while AI analysis for predictive computing was the mainstream application of data lakes.With the maturation of the integrated lake-cum-basin approach, AI+BI dual-mode will become an important form of load for big data computing and analysis.

With the continuous development of big data technology, the integration of offline processing and real-time processing, as well as data storage and analysis, has provided tremendous potential for data services and applications by breaking through performance bottlenecks in big data systems.

Correspondingly, with the popularization of data intelligence service awareness, it is particularly crucial for vendors to seamlessly integrate data analysis services with machine learning services to provide more intelligent and user-friendly product services for users without an AI algorithm background, such as:

(1) Generality: Machine learning models can be directly inferred through SQL;

(2) Usability: Provides simple tools to enable businesses to use existing data for machine learning model training;

(3) Transparency: Visualize data preparation with low-code for data cleaning and transformation;

(4) Intelligent Ops: AIOPS capabilities are applied to the daily ops of data platforms.

The machine learning platform is deeply integrated with the big data platform, and the data processing speed and automation level of the integrated machine learning big data platform will be improved by a generation. To achieve the integration of machine learning and big data, according to relevant papers, the following requirements need to be met:

(1) Isolation mechanism: There is no mutual interference between artificial intelligence and big data;

(2) Seamless code integration: enables the big data platform to support native machine learning code;

(3) Fusion Framework: In the data processing layer, empowerment layer, and application layer, a data fusion engine is introduced to deeply integrate the data processing layer and empowerment layer;

To achieve an improvement in machine learning production efficiency, the following requirements need to be met:

(1) Full lifecycle platformization: Covers end-to-end capabilities from data preparation, model construction, model development to model production;

(2) Pre-built machine learning algorithms and frameworks: allowing users to call them directly without having to build them themselves;

(3) Rapid resource startup: underlying resources are used as needed without preconfiguration, utilizing a unified computing cluster.

The fully serverless lakehouse integrated architecture means that data storage, data query engine, data warehouse, data processing framework, and data catalog products all support serverless deployment.

Serverless deployment provides services through FaaS + BaaS, allowing users to develop, run, and manage applications without building or maintaining complex infrastructure. After the lakehouse is integrated with Serverless, it will have two advantages:

Streamline the usage process

We provide users with a lakehouse integrated architecture that is deployed serverlessly, enabling them to enjoy an easier-to-use experience. The fully managed and operation-free approach also allows users to focus on their business rather than technical logic, which aligns with the cloud-native concept.

Flexible cost optimization

Serverless deployment can provide pay-as-you-go, eliminating the need for waiting for payments and achieving more efficient resource utilization. It is more cost-effective for enterprises with significant time-dependent usage patterns.

Serverless deployment has become a product feature that top vendors compete for in their HBase product series, aiming to better support user needs:

(1) Amazon Cloud enables Serverless full serverless deployment of a lakehouse by integrating Redshift+EMR+MSK+Glue+Athena+Amazon Lake Formation with Serverless capabilities.

(2) Huawei CloudStack + DLI Serverless + FusionInsight MRS + DWS enables a big data system with Serverless deployment;

(3) Alibaba Cloud's DLA creates a cloud-native + Serverless + database and big data integrated architecture Maxcompute through core components Lakehouse, Serverless Spark, and Serverless SQL;

(4) Other Serverless lakehouse products include Databricks Serverless SQL, Azure Synapse Analytics Serverless, and Mobile Cloud Native Lakehouse.

Data management solution providers need to focus on user experience and continuously deepen product technology in dimensions such as data warehouses, data lakes, lakehouse solutions, and IaaS-related services.

Against the backdrop of market users demanding higher flexibility from data warehouses and greater growth potential from data lakes, the 'lake-cum-hive' concept is a shared understanding among industry vendors and users about future big data architectures.

Despite having significant advantages at the conceptual level, the integration of lake and warehouse systems still faces numerous problems in actual production due to immature technology or services. Potential users remain cautious due to concerns about user experience and stability, or unclear value for investment in replacing mature and stable original systems.

Manufacturers need to focus on user experience, continuously delving into product technology from multiple dimensions.

The Chinese data management solution market is in a stage of steady growth, with competing entities being divided into tiers based on their performance in terms of innovation and growth capabilities.

This report measures the competitive strength of outstanding manufacturers in the industry through two main dimensions: market growth index and innovation index.

The growth index measures the competitiveness of competing entities in the dimension of data management solution growth, including innovative technologies or capabilities such as data storage, data preparation, machine learning analysis support, integrated lakehouse architecture, multi-dimensional and multi-framework data analysis; while the innovation index measures the competitiveness of competing entities in terms of their data management solutions. The further to the right it is positioned, the better the market growth capability and level of the data management solution, including compatibility, query & computing performance, disaster recovery security, service support, industrial chain ecosystem, and data service scenario solutions.

Frost & Sullivan, in collaboration with LeadLeo Research Institute, has conducted a multi-factor hierarchical assessment of the competitiveness of China's data management solution market based on two major evaluation dimensions: growth index and innovation index. This assessment is supported by eleven key indicators including data storage, data preparation, data analysis support, data analysis, process orchestration management, compatibility, performance, disaster recovery construction, service support, open-source community and industrial chain ecosystem, as well as data service scenario solutions.

Based on the comprehensive scores of the 'Innovation Index' and 'Growth Index', Amazon Web Services, Huawei Cloud, Alibaba Cloud, Kingsoft Cloud, StarRing Technology, and Inspur Cloud are positioned in the leadership tier of the Chinese data management solution market.

Amazon Web Services:Amazon Web Services (AWS) has upgraded its Intelligent Lakehouse architecture, breaking data silos through Amazon Athena and Amazon Lake Formation to build a unified data governance foundation in the cloud. Amazon SageMaker's full-process machine learning components help transform machine learning from experimentation into practice, empowering business personnel to explore agile business innovation. AWS provides products and services based on global commercial practices through its professional and in-depth technical support service experience, offering mature solutions for various data service scenarios for customers across industries.

Huawei Cloud:Huawei Cloud FusionInsight MRS Intelligent Data Lake integrates with the AI development platform ModelArts for digital intelligence. It achieves lakehouse collaboration through the HetuEngine one-stop interactive SQL analysis engine, providing a data architecture that supports rich business scenarios with offline, real-time, and logical data lakes in one marketplace. Huawei Cloud leads open-source efforts in the big data field, adhering to openness. It collaborates with over 1,000 industry application ecosystem partners to build implementation scenario solutions covering finance, operators, internet, public administration, and other fields.

Alibaba Cloud:Alibaba Cloud Maxcompute adapts to various data lake and warehouse cases, constructing the best practice for integrating lakes and warehouses. It provides unified development and management of data through DB-level metadata perspective, seamlessly integrates with the machine learning platform PAI, and offers super-large-scale machine learning processing capabilities. At the same time, Maxcompute is deeply integrated with Hologres, providing customers with an offline real-time integrated massive cloud data warehouse structure. Combined with open development construction and deep integration with partner ecosystem products, it provides a multi-dimensional product portfolio for various big data scenarios of multi-industry users.

Jinshan Cloud:Jinshan Cloud's unified metadata service LMS, which is part of the unified cloud-native data engine KCDE, supports building logical data lakes for real-time, offline, and analytical use cases. The Big Data Development and Governance Platform KDC is integrated with the machine learning platform KingAI, providing one-stop data mining services based on a unified data foundation. Jinshan Cloud constructs full-domain cloud-native capabilities through a diversified product matrix, widely covering big data cloud platform application solutions in finance, internet of everything, healthcare, and public services industries.

Inspur Cloud:Inspur Cloud Big Data Storage and Analysis (IEMR) provides multi-hoop and multi-bucket associated computing capabilities. It constructs IDLF through data lakes to provide lake-bucket data collaborative invocation capabilities, is deeply adapted with the machine learning platform (IMLP), and offers more than 200 preset models and over 100 ready-to-use industry model invocation capabilities. Inspur Cloud IEMR has a high-security disaster recovery construction level. The IBP data product line can provide personalized product delivery forms according to business scenarios, and offers rich scenario solutions and implementation experience for industries such as telecommunications, healthcare, finance, government affairs, and other large state-owned enterprises.

Starlink Technology:Xinghuan Technology's TDH big data basic platform creates a lakehouse integrated solution by providing a unified SQL compiler Transwarp Quark and a unified distributed computing engine Transwarp Nucleon. It breaks through the traditional Hadoop + MPP hybrid architecture to achieve batch-stream collaboration and multi-modal integration features. Xinghuan Technology provides componentized technical services and highly decoupled mature products for all big data process tasks, with implementation cases covering industries such as finance, government affairs, transportation, operators, postal services, healthcare, and energy.

HUCAM collaborates to fully empower digital and intelligent integration, entering a period of deepening application innovation in the big data field

获取白皮书

联系我们