沙利文正式发布《2026年AI基础设施管理平台白皮书》 | 多芯片时代，从算力碎片化迈向系统级编排（内附全文获取方式）

AI infrastructure is moving from "single-chip performance competition" to a new stage of "system-level coordinated orchestration". With breakthroughs in foundational models, accelerated implementation of enterprise-level AI applications, and the rise of agent workflows, the demand for AI computing power is shifting from training-driven to inference-driven. Inference workloads feature continuous online operation, high concurrency, real-time performance, and cross-node distribution, making GPU clusters not just a stack of hardware resources, but requiring unified scheduling and operations based on resource utilization, latency stability, model execution consistency, and SLA compliance.

At the same time, Chinese AI infrastructure shows increasingly significant heterogeneity. AI infrastructure management platforms represented by Phancy need to be deployed in parallel environments for GPUs/AI accelerators such as NVIDIA, Ascend, Cambricon, Hygon, and luvatar CoreX, which promotes more diverse computing power supply and introduces complex differences in chip architecture, runtime environment, compiler, operator adaptation, and deployment processes. Multiple chips coexisting has become a structural reality in China's AI ecosystem, and enterprises face challenges such as fragmented resource pools, repeated model adaptation, insufficient scheduling efficiency, fluctuating service performance, and rising operation and maintenance costs.

In June 2026, Frost & Sullivan (hereinafter referred to as "Frost & Sullivan") officially released the '2026 AI Infrastructure Management Platform White Paper' (hereinafter referred to as the 'White Paper'). The White Paper conducts a systematic study on the industrial trend of Chinese AI infrastructure moving from fragmentation to orchestration, heterogeneous GPU resource management, vGPU control plane, multi-model management workstations, competitive landscape, and industry practice cases. Its purpose is to fully present the development logic, core capabilities, and implementation value of AI infrastructure management platforms, providing reference for industry participants and stakeholders.

Scan the QR code to obtain the report

The following is an excerpt from the White Paper; for detailed content, please scan the QR code for the full version:

Through systematic analysis, the White Paper reveals the core context of AI infrastructure upgrade under the background of "multiple chips coexisting, scalable inference, and ecosystem-based models": the AI computing power bottleneck is shifting from "whether having a GPU" to "whether able to operate heterogeneous computing power and model services efficiently, stably, and measurably"; the vGPU control plane and multi-model management workstations complement each other in managing heterogeneous computing power resources and unified model execution management, enabling enterprises to move from scattered computing power islands to a unified, orchestratable, and reliable AI infrastructure system.

01 Industry Background: Inference explosion drives AI infrastructure from chip performance to cluster-level coordination

AI workloads are rapidly exceeding what single-chip efficiency improvement can cover. With enhanced foundational model capabilities, widespread enterprise AI applications, and expansion of agent scenarios, inference demands become the main source of AI computing power consumption. The White Paper indicates that reasoning-oriented models account for a growing proportion in real business workloads, and user interactions with AI systems are evolving towards more complex, longer-chain, and higher-frequency inference tasks.

Source: Frost & Sullivan analysis

Thus, the expansion logic of AI infrastructure is changing. In the past, the industry focused on single-card performance, single-cluster training capability, and hardware procurement scale; in the future, with continuous online inference services, rapid growth in concurrent requests, and cross-node distribution of model deployment, system-level scheduling, elastic resource reuse, service isolation, and end-to-end SLA compliance will be key to the large-scale implementation of AI infrastructure. AI infrastructure is moving from "hardware management" to "computing power economics" and "business outcome delivery".

02 Industry Status: Under multiple chips coexistence, fragmentation, low utilization, and unstable SLA are core problems

As Chinese AI infrastructure expands rapidly, a pattern of parallel deployment of international GPUs and domestic AI accelerators is emerging. On one hand, the NVIDIA ecosystem has a strong standardized foundation in software stacks such as CUDA, cuDNN, TensorRT, NCCL, and vLLM; on the other hand, domestic accelerator manufacturers often have different runtime, compiler, operator adaptation frameworks, and deployment paths, significantly increasing cross-platform deployment complexity.

In the heterogeneous computing environment, the industry needs not only more hardware but also a unified control plane that can convert raw GPU capabilities into AI infrastructure capable of meeting SLA requirements. The White Paper states that the core value of vGPU lies in standardizing, scheduling, optimizing, and stabilizing processes, helping enterprises abstract GPU resources from different manufacturers, architectures, and locations into unified, assignable, and orchestrable computing units, thus laying the foundation for subsequent resource pooling, fine segmentation, intelligent scheduling, and deterministic execution.

In the heterogeneous computing environment, AI infrastructure is evolving from a model where hardware resources are directly managed to a layered architecture centered on a unified control plane. The SDC Orchestration Layer is responsible for global scheduling and SLA compliance for workloads, while the Resource Abstraction Layer decouples and standardizes heterogeneous computing resources, thereby masking differences between different chips and clusters, allowing the system to dynamically allocate resources and perform elastic scheduling based on business workload needs, ultimately achieving unified orchestration and stable delivery across heterogeneous environments.

Source: Frost & Sullivan analysis

03 vGPU solution: Building a unified control plane for heterogeneous computing power to address current pain points

GPUs have become core resources of AI infrastructure, but having a GPU does not mean having available computing power. Different workloads such as training, fine-tuning, inference, and development testing differ in duration, concurrency mode, latency requirements, memory usage, and topology dependencies. Static, whole-card, and pool-based resource management methods easily lead to resource idleness, fragmentation, and insufficient delivery efficiency, making it difficult to support production-level AI services.

Source: Frost & Sullivan analysis

Rise vGPU transforms GPUs and AI accelerators from different manufacturers, architectures, clusters, and locations into standardized, assignable computing units through four layers: abstraction, orchestration, optimization, and deterministic execution. At the abstraction layer, vGPU divides the computing power and memory of physical GPUs into finer-grained resources, enabling inference, development testing, and lightweight tasks to share the same physical card as needed. At the orchestration layer, the platform schedules tasks based on business priorities, topology relationships, real-time load, and resource compatibility, improving task placement efficiency and service predictability.

Source: Frost & Sullivan analysis

At the optimization layer, Rise vGPU converts static reserved idle capacity into operable and measurable effective computing power through fine-grained division, super-resolution, spatio-temporal reuse, dynamic recovery, and elastic reuse. At the deterministic execution layer, the platform ensures stable computing power for critical tasks through resource isolation, contention control, tenant isolation, and SLA-oriented execution, reducing the impact of "noisy neighbors". Compared to the coarse-level GPU device scheduling of native Kubernetes, Rise vGPU補s the resource control layer of heterogeneous accelerators, allowing GPUs to be pooled, segmented, scheduled, isolated, measured, and managed.

04 Multi-model management workstation solution: Supporting multimodal and large-scale model management for future model ecosystems

Computing power orchestration alone cannot solve all issues in enterprise AI implementation. As enterprises adopt foundational models, fine-tuned models, industry models, self-developed models, and third-party models simultaneously, AI systems are moving from single-model deployment to large-scale model ecosystems. Different models have differences in inference frameworks, runtime environments, memory usage, latency requirements, throughput demands, and business scenarios, and model execution consistency is becoming an important capability of AI infrastructure.

The multi-model management workstation provides unified execution and orchestration capabilities for multiple models, runtimes, and chips, covering model integration, service registration, runtime adaptation, Prompt templates, fine-tuning, RAG integration, function calls, data management, monitoring and governance, and lifecycle management. Its value lies in integrating scattered models, runtimes, and heterogeneous computing power resources into a unified management system, reducing repeated adaptation costs, improving model deployment efficiency, execution stability, and cross-environment deployment consistency.

Source: Frost & Sullivan analysis

The White Paper evaluates model execution and management capabilities from five dimensions: model and chip compatibility, execution stability and performance, lifecycle management and deployment efficiency, model-GPU coordination ability, and ecosystem and service capabilities. In the overall ranking, Phancy ranks at the leading level, especially in execution stability and performance, and model-GPU coordination and heterogeneous compatibility. When combined with the vGPU control plane, the multi-model management workstation and vGPU control plane can form an end-to-end AI infrastructure orchestration capability from "heterogeneous computing power resource control" to "unified model execution delivery".

Source: Frost & Sullivan analysis

Industry practices further verify the value of this approach. In a case of a large state-owned commercial bank, Phancy's multi-model management workstation ModelHub and vGPU resource orchestration helped the client build a unified AI infrastructure platform, achieving large-scale model management, unified heterogeneous resource scheduling, and enterprise-level governance. The platform manages over 25,000 models, with more than 3,000 production AI services, over 50 LLM services, covering 300+ GPU server nodes and 20+ heterogeneous platforms, and achieving a 70% increase in GPU resource efficiency, a 60% increase in AI deployment efficiency, a 30% increase in inference stability, a 40% reduction in O&M costs, and a more than 4-fold increase in GPU utilization.

Source: Frost & Sullivan analysis

05 Summary

Looking ahead, the focus of AI infrastructure competition will shift from hardware ownership to system-level operational capabilities. Multiple chips coexisting will persist long-term, inference workloads will continue to grow, and enterprise model assets will expand rapidly. Platforms with unified resource pools, fine segmentation, intelligent scheduling, SLA compliance, model execution consistency, and full lifecycle governance will be key for enterprises to unlock the value of AI computing power, reduce infrastructure costs, and support large-scale AI innovation. The AI infrastructure orchestration capabilities represented by the vGPU control plane and multi-model management workstation are driving the industry from "fragmented computing power management" to "unified, elastic, and measurable AI service delivery".

Frost & Sullivan Officially Releases '2026 AI Infrastructure Management Platform White Paper' | In the Multi-Chip Era, Moving from Fragmented Computing Power to System-Level Orchestration (Full Text Access Method Included)

01

Industry Background: Inference explosion drives AI infrastructure from chip performance to cluster-level coordination

02

Industry Status: Under multiple chips coexistence, fragmentation, low utilization, and unstable SLA are core problems

03

vGPU solution: Building a unified control plane for heterogeneous computing power to address current pain points

04

Multi-model management workstation solution: Supporting multimodal and large-scale model management for future model ecosystems

05

Summary

Get Whitepaper

Contact Us