
2026-06-14T18:30:00.000Z
Jun 15, 2026 Blog

Hyperscalers spent the first AI cycle arguing about training. Amazon committed USD 83 billion in capital expenditure in 2024, primarily toward AI-focused data centres, and competitors matched that logic: Microsoft tracking toward USD 120 billion in 2026 capex, Alphabet projecting USD 175 to 185 billion for the same year. The training narrative justified all of it. Build the model, win the market.
That logic is now wrong in the way that matters most: allocation. Inference now accounts for an estimated 60 to 70 percent of total AI compute demand across major hyperscalers, up from roughly 40 percent in 2024. Every deployed application, every autonomous agent, every AI-generated response runs on inference hardware. Training is a one-time cost; inference is a recurring one that compounds with every user interaction. Kaiso Research's primary dataset puts the AI Inference Hardware Market valuation at USD 43.78 billion in 2025 and projects it to reach USD 410.35 billion by 2035. The more interesting number is the implied run-rate math: at that trajectory, inference hardware procurement becomes the single largest capital line item in enterprise technology by the early 2030s.
The question isn't whether to buy inference infrastructure anymore. It's which hardware stack, deployed where, governed by which regulatory framework.
The economics shifted when model deployment moved from novelty to operational dependency. A frontier model trained once costs tens of millions of dollars. That same model serving a million users daily runs inference workloads continuously, and at the token volumes generated by enterprise AI deployments, that ongoing cost dwarfs training within twelve to eighteen months of production launch.
Kaiso Research's primary market data shows that AI agents performing multiple inference cycles per task are dramatically increasing enterprise compute demand beyond what single-query AI systems required. An AI agent that autonomously completes a multi-step workflow does not consume one inference call; it may execute dozens of model calls per task, each requiring hardware capacity. Scale that across enterprises running thousands of concurrent agents, and the demand curve becomes non-linear in a way that single-query chatbot deployments never produced.
The second structural force is disaggregation. Cloud AI inference commands the largest deployment share, led by AWS, Microsoft Azure, Google Cloud, and Oracle Cloud, but enterprise demand is no longer purely cloud-routed. On-premises, hybrid, and edge inference deployments are all growing, each driven by distinct economics: latency requirements, data sovereignty, compliance constraints, and cost optimization at scale. A market previously unified under the heading "GPU cloud" has fractured into four distinct procurement categories, each with its own hardware requirements and vendor economics.
North America holds the largest regional share in 2025. That will not hold at the same relative weight through 2035. Sovereign AI programs in the Gulf, government-mandated compute infrastructure in the EU, and manufacturing-driven edge deployments across Southeast Asia will redistribute market share in ways current procurement patterns do not yet reflect.
NVIDIA Blackwell GPUs currently dominate enterprise and cloud deployment, and the financial case is documented. NVIDIA's GB200 NVL72 system generates a USD 75 million return on a USD 5 million investment in DeepSeek R1 token revenue, a 15x ROI that makes alternative hardware arguments difficult to sustain at the account level. Blackwell Ultra and NVIDIA Dynamo, introduced in March 2025, are specifically engineered for AI reasoning model inference, disaggregating prefill and decode phases across different GPUs to maximize utilization at factory scale. The NVIDIA Colossus supercomputer, built for xAI in 122 days with over 200,000 NVIDIA GPUs, is the most visible proof point for what full-stack Blackwell deployment looks like in production.
The first crack is AMD. The MI300X series, deployed at Microsoft Azure for GPT-3.5 and GPT-4 inference and broadly deployed at Meta for Llama 3 and Llama 4, has demonstrated that NVIDIA does not hold a monopoly on production-grade inference. Seven of the ten largest model builders now run production workloads on AMD Instinct accelerators, including Meta, OpenAI, Microsoft, and xAI. The MI355X delivers 40 percent better tokens-per-dollar than NVIDIA's B200 on Llama 3.1 405B inference in FP4. AMD's MI400 series, with HBM4 memory technology, is positioned to capture a materially larger slice of the inference workload market by Q4 2026. The ROCm software gap is real, but it is measured in months of effort now, not years.
The second crack is vertical integration. Google's TPU v7 Ironwood, introduced at Google Cloud Next 2025, is the first Google accelerator purpose-built for inference rather than training. Each chip delivers 4,614 TFLOP/s of FP8 compute, 192 GB of HBM3e, and 7.3 TB/s of memory bandwidth, on a 5-nanometer process. Scaled to a 9,216-chip pod via optical circuit switching, the system achieves 42.5 Exaflops. Google now uses Ironwood to serve its own Gemini models across YouTube, Search, and Gmail, keeping those inference economics off NVIDIA's revenue line entirely. Amazon's custom silicon follows the same logic: in Q1 2026, Amazon's chip business crossed a USD 20 billion run rate with USD 225 billion in Trainium revenue commitments.
The third crack is at the edge. Microsoft's Copilot+ program requires a minimum of 40 TOPS of on-device NPU performance, establishing a hardware floor that has forced every major PC silicon vendor to restructure their roadmaps. Intel's Panther Lake delivers 50 TOPS via its NPU 5 architecture; Qualcomm's Snapdragon X2 Elite reaches 45 TOPS from its Hexagon NPU; AMD's Ryzen AI 400 series delivers comparable performance. AI PC shipments with dedicated NPU capability now account for nearly 40 percent of the global market, and the installed base is projected to exceed 100 million units by 2027. Inference that was previously routed through cloud GPUs is shifting to client devices, reducing cloud compute demand for latency-sensitive use cases and creating a category of inference hardware that NVIDIA's data centre architecture was not designed to serve.
The CAGR is not a single-market phenomenon. It is the aggregate output of four concurrent structural shifts that are each growing independently.
The first is agentic AI proliferation. Enterprises running an average of 12 AI agents today project that number to climb 67 percent within two years, according to Salesforce research. Each agent completing a multi-step workflow generates orders of magnitude more inference cycles than a single-query assistant. The inference hardware requirement for a 100-agent enterprise deployment has no precedent in the 2024 procurement playbook.
The second is sovereign AI infrastructure. HUMAIN, an AI subsidiary of Saudi Arabia's Public Investment Fund, secured a framework to deploy up to 600,000 NVIDIA AI accelerators over three years, confirmed at the May 2025 Saudi-US Investment Forum. The Stargate UAE cluster in Abu Dhabi, announced in March 2025 alongside OpenAI, Oracle, SoftBank, and Cisco, represents a separate sovereign AI infrastructure commitment at scale. These are not cloud deployments mediated by commercial pricing. They are national infrastructure programs buying hardware at quantities that reshape supply chain timelines and pricing for years forward. Kaiso Research's coverage of this segment confirms sovereign AI infrastructure as a structured procurement category, not an opportunistic purchase.
The third is inference workload diversification. Application categories driving inference demand in 2026 are qualitatively different from 2024. Generative AI and recommendation engines have been joined by autonomous vehicles requiring real-time Level 4 sensor fusion, robotics and physical AI platforms, healthcare AI needing sub-millisecond diagnostic response, and financial AI running continuous market surveillance. No single hardware architecture optimally serves all of them, driving procurement across GPUs, NPUs, ASICs, and custom silicon simultaneously.
The fourth is hyperscaler capex acceleration. The five largest US cloud and AI companies are guiding toward USD 635 to 690 billion in combined 2026 capital expenditure, more than double 2024 levels. Approximately 75 percent of that is directed at AI infrastructure, including GPUs, high-bandwidth memory, networking, and data centres. At that allocation rate, the inference hardware market is being funded by capital commitments that eliminate demand uncertainty for the next three to five years.
The inference hardware market's most underappreciated constraint is memory. Not compute throughput. Memory.
Serving a 70-billion-parameter model requires fitting the entire model or a substantial shard of it in accelerator memory during inference. The KV cache for long-context conversations multiplies that requirement. NVIDIA's GB200 NVL72 addresses this with 130 TB/s NVLink bandwidth connecting 72 B300 GPUs as a unified memory fabric. Google's Ironwood addresses it with 192 GB of HBM3e per chip, a 6x increase over its Trillium predecessor. AMD's MI400 series targets it with HBM4 technology. The pattern is consistent: each successive generation of inference hardware is primarily competing on memory capacity and bandwidth, with raw compute throughput increasingly secondary.
This has direct implications for procurement strategy. Enterprises that bought inference infrastructure in 2024 on throughput specifications are now discovering that memory constraints are the binding limitation on model size and context length. The replacement cycle is compressing. Infrastructure that appeared over-provisioned on compute in 2024 is under-provisioned on memory in 2026 for models currently in production.
The second emerging trend is inference disaggregation. NVIDIA Dynamo separates the prefill phase from the decode phase, routing them to different hardware pools optimized for each. The same logic structures AWS's Trainium and Inferentia split. Procurement teams that treat inference as a single undifferentiated workload are leaving measurable cost reduction unrealized.
The third trend is the edge inference inflection point. The 2025-to-2027 period marks the shift from edge AI inference as an experimental deployment to a standard enterprise requirement. Qualcomm's Snapdragon X Elite platform, Intel's Core Ultra 300 series, and AMD's Ryzen AI 400 series are shipping in volume at specifications that support local execution of language models with up to 13 billion parameters. The Microsoft Copilot+ certification program is setting procurement standards for enterprise device refreshes. For IT infrastructure leaders, inference hardware strategy now spans the full stack: cloud GPU clusters, on-premises accelerator servers, and client device NPU fleets, with utilization decisions driven by workload characteristics.
Not all inference hardware produces equivalent economics. Four architectural choices determine whether a deployment generates competitive returns.
Memory bandwidth per accelerator determines maximum model size served without sharding. Memory sharding increases latency, expands failure surface area, and complicates orchestration. Accelerators with 192 GB or more of HBM3e per chip, like Google's Ironwood, serve frontier models from a single device in configurations where NVIDIA's previous H100 required multi-chip deployments.
Interconnect fabric determines at what scale a cluster behaves as a unified system. NVIDIA's NVLink at 130 TB/s in the NVL72 and Google's Inter-Chip Interconnect at 1.2 TB/s bidirectional are solving the same problem: as inference workloads scale, inter-GPU communication latency becomes the binding constraint on throughput. Optical circuit switching allows reconfigurable topology without the latency penalty of traditional electrical switching.
The decode phase of inference is fundamentally different from prefill in its hardware requirements, and systems that treat both identically waste resources on both. NVIDIA Dynamo's disaggregated serving is the most mature production implementation of phase-specific routing. Deployments running reasoning-heavy models on non-disaggregated infrastructure are typically operating at 30 to 40 percent below optimal efficiency.
Software depth in the CUDA developer toolchain remains NVIDIA's most durable competitive advantage. CUDA's dominance means that nearly every AI framework, every quantization tool, every optimization library assumes NVIDIA GPU execution by default. AMD's ROCm 7, launched in July 2025, has materially closed this gap for the top model families, but the long tail of enterprise workloads still encounters compatibility friction. Custom silicon players like Google's TPU and Amazon's Trainium run proprietary compilers that require dedicated porting effort for models not natively supported.
The inference hardware competitive landscape does not resolve to a single axis. NVIDIA, AMD, Google, Amazon, Qualcomm, and Intel each hold dominant positions on at least one dimension, and no vendor currently leads on all three.
On performance-per-inference-token: NVIDIA Blackwell sets the current benchmark. The GB200 NVL72's InferenceMAX v1 results are the highest published figures for production-grade inference on frontier models. That position faces direct challenge from NVIDIA's own Vera Rubin platform, announced with a 10x performance-per-watt improvement, and from AMD's MI400 series with HBM4 memory advantages on inference-heavy workloads.
On total cost of ownership across the deployment lifecycle: AMD's MI355X is already delivering 40 percent better tokens-per-dollar than competing configurations on specific workloads. For inference deployments where the primary optimization target is cost efficiency rather than peak throughput, AMD's current offering changes the procurement calculus. Microsoft and Meta are running production workloads on MI300X specifically for this reason.
On platform depth and cloud lock-in: Hyperscaler custom silicon wins this axis by design. Google's TPU inference economics are unavailable to competitors because Ironwood is only accessible through Google Cloud. Amazon's Trainium is only accessible through AWS. These are not open hardware platforms; they are vertical integration strategies where inference hardware becomes a retention mechanism for cloud platform commitments. Enterprises running AI-intensive workloads on Google Cloud or AWS that have not modeled their inference cost trajectory on custom silicon versus NVIDIA GPU instances are likely misoptimizing a seven-figure annual line item.
The hyperscaler capex numbers dominate headlines. The more strategically interesting signal is where capital is flowing in the segments hyperscalers do not own. The five largest US cloud and AI companies are guiding toward USD 635 to 690 billion in combined 2026 capital expenditure, with approximately 75 percent directed at AI infrastructure, creating a demand floor for inference hardware that sovereign buyers and specialized vendors are now competing to fill.
Groq announced a USD 1.5 billion Saudi investment for what it describes as the world's largest inference data center, to be deployed via Aramco Digital. Groq's LanguageProcessingUnit architecture is purpose-built for inference throughput at deterministic latency, a characteristic GPU architectures cannot match for latency-sensitive applications. A USD 1.5 billion commitment to a single inference-specialist vendor is not a hedge. It is a sovereign AI strategy that assumes specialized inference hardware will outperform general-purpose GPU infrastructure at national scale.
Cerebras, Graphcore, SambaNova, and Tenstorrent are each targeting segments of the inference market where GPU architectures are structurally inefficient: ultra-low-latency financial AI, industrial real-time control, and on-premises deployments where memory density per watt matters more than raw TFLOP/s. These companies are not competing with NVIDIA on NVIDIA's terms. They are identifying workload categories where the standard GPU architecture produces inferior economics.
The pattern that emerges from Kaiso Research's primary market data is that inference hardware investment is bifurcating: commodity GPU capacity for general-purpose workloads, and specialized accelerators for workloads where latency, energy efficiency, or memory density requirements exceed what commodity hardware can economically provide.
The EU AI Act's August 2, 2026 enforcement deadline for Annex III high-risk AI systems creates a compliance requirement that propagates directly to inference hardware procurement. High-risk AI system mandates becoming enforceable on August 2, 2026 include continuous risk management, data governance with inference-time protections, tamper-evident logging retained for six months, and hardware-level cybersecurity resilience under Article 15.
NVIDIA's Confidential Computing capability, already used for Apple's Private Cloud Compute, addresses the Article 15 hardware resilience requirement directly. It is the only currently shipping GPU platform with hardware-level attestation for inference execution. For enterprises deploying high-risk AI systems, as defined by the EU AI Act, in EU markets, the absence of Confidential Computing support in the hardware stack is not a preference; it is a compliance gap with enforceable consequences from August 2026 onward.
The General-Purpose AI model provisions, which became enforceable in August 2025, require providers of large-scale AI models to implement technical documentation and transparency obligations. These requirements apply regardless of where inference hardware is located, creating a regulatory surface area that extends to non-EU organizations whose AI outputs are used in EU markets.
Enterprises that treat the EU AI Act as a software compliance problem are misreading it. The hardware layer carries its own obligations.
For cloud architects and infrastructure leads: The 2024 inference procurement playbook, which assumed renting GPU capacity from AWS, Azure, or Google Cloud was the only viable path, is no longer accurate for AI-intensive deployments above USD 2 to 3 million annually in inference costs. At that spend level, reserved capacity, on-premises ASICs, and custom silicon access typically produce 30 to 50 percent cost reductions against on-demand GPU pricing. The analysis belongs on the CFO's desk, not the engineering team's.
For enterprise technology buyers considering hardware refresh cycles: The NPU decision isn't a PC procurement call. It is an AI strategy decision expressed through device specification. Copilot+ certification, and the 40 TOPS floor that underpins it, determines which models run locally, which applications retain user data on-device, and which latency commitments are achievable without cloud dependencies. Enterprises standardizing on non-Copilot+ devices in 2025 and 2026 are building infrastructure gaps into their AI roadmaps that will require a second refresh cycle to correct.
The inference hardware market's growth trajectory is real, but three structural risks are understated in most analyses.
The first is power infrastructure. AI data center power demand is projected to reach 156 GW by 2030, requiring cumulative investment of approximately USD 5.2 trillion through the end of the decade. The limiting variable is not silicon; it is grid capacity. Microsoft's USD 80 billion unfulfilled Azure backlog is largely a function of power availability, not demand weakness. Enterprises that have modeled inference hardware capacity without modeling power availability are building on an unstable assumption.
The second is export control asymmetry. US Commerce Department controls on advanced GPU exports create differential access to inference hardware across geographies. Saudi Arabia's HUMAIN program required explicit export licenses for GB300 superchips; Microsoft secured its UAE GPU shipments under a separate license. For enterprises operating in markets with restricted hardware access, the inference hardware gap is a geopolitical access problem, not a vendor selection one.
The third is CUDA transition risk. Enterprises that have built inference pipelines assuming NVIDIA GPU execution throughout will encounter friction as AMD, Google TPU, and custom silicon enter their procurement mix. ROCm improvements in 2025 address a portion of this, but the porting effort for custom CUDA kernels, quantization pipelines, and inference optimization libraries is a real operational cost that never appears in hardware RFPs.
Kaiso Research's primary dataset projects USD 410.35 billion by 2035 at a 25.08% CAGR. Inference demand is projected to represent 70 to 80 percent of total AI compute demand by 2035, as model training concentrates among a smaller number of frontier labs while inference deployments proliferate across every industry vertical.
The hardware architecture that will dominate 2035 inference does not yet exist at scale. NVIDIA's Vera Rubin platform promises 10x performance-per-watt over Blackwell. AMD's MI400 series targets memory-limited inference with HBM4. Optical interconnects within data centres, which Google is already deploying at Ironwood pod scale, will reduce multi-chip inference latency to levels that make 100,000-chip clusters behave as unified inference engines.
The procurement decisions made between 2026 and 2028 will determine which organizations reach 2030 with inference infrastructure that is competitive in cost and capability, and which are locked into architectures optimized for 2024 model generations running 2030 model sizes. The window for correct architecture selection is open. It will not stay open indefinitely.
Latest Blogs

2026-06-14T18:30:00.000Z

2026-06-12T18:30:00.000Z

2026-06-11T18:30:00.000Z