Jun 09, 2026 Blog

Vision-Language-Action (VLA) Models: How Google DeepMind, NVIDIA, and Physical Intelligence Are Building a $40 Billion Industry

What the VLA Models Market Numbers Actually Mean: $3.89B Today, $40.50B by 2035

Three years ago, robots could not be retasked without months of reprogramming. Today, fine-tuning a base Vision-Language-Action model on 200 to 500 task-specific demonstrations consistently outperforms training a policy from scratch on over 1,000 demonstrations, a result that rewrites the economic logic of industrial automation. Kaiso Research's primary dataset puts the global Vision-Language-Action (VLA) models market at USD 3.89 billion in 2025, projected to reach USD 40.50 billion by 2035 at a compound annual growth rate of 26.40% across the full forecast horizon. Software components lead revenue. Robotics applications dominate adoption. North America anchors the highest-value procurement while Asia-Pacific sustains the fastest volume growth through sustained domestic AI investment.

This report synthesizes the architectural foundations, competitive dynamics, cross-sector deployment evidence, investment activity, regulatory developments, and strategic implications of the VLA market for executive decision-makers navigating automation investment in 2026 and beyond.

Introduction: The Boundary That No Longer Exists

For most of modern robotics history, perception, reasoning, and physical control were three separate engineering problems handled by three separate software stacks. A robot on a BMW assembly line could weld the same spot on the same car frame all day. Move the frame two centimeters, and it stopped working. The brittleness was not a bug. It was the fundamental architecture.

Vision-Language-Action models eliminate that architecture. They are single neural networks that jointly process visual observations from cameras, natural language instructions from operators, and generate physical motor commands as outputs, all within one unified forward pass. A VLA-equipped robot arm told to "place the red gasket in the upper-left cavity" can execute that instruction against a novel part, in a lighting condition it has never seen, without explicit reprogramming. That capability, absent from commercial robotics as recently as 2023, is now deployed in at least eleven enterprise production environments as of Q1 2026.

The transition from research artifact to production system happened faster than most procurement teams modeled. Understanding why, and where the trajectory leads, is the strategic task this report addresses.

A $40 Billion Market Built on Three Preconditions That Converged in 24 Months

Kaiso Research's primary dataset across the global robotics and infrastructure sector places the VLA Models Market at USD 3.89 billion in 2025, expanding to USD 40.50 billion by 2035 at a CAGR of 26.40% through the full forecast period. The base year reflects actual commercial revenue from VLA software platforms, model training infrastructure, inference APIs, edge deployment hardware, and associated professional services.

The growth rate reflects more than optimism. Three structural preconditions converged in 2023 and 2024 that made this market possible. First, internet-scale Vision-Language Models reached sufficient capability to jointly process language and visual input with the semantic richness required for real-world grounding. Second, the Open X-Embodiment dataset, a consortium effort across 22 research institutions collecting 1.4 million robot trajectories from 22 different hardware platforms, provided the training foundation for cross-embodiment generalization. Third, humanoid robot hardware from manufacturers including Figure AI, 1X Technologies, and Unitree Robotics reached mechanical maturity sufficient to justify the software investment layered on top.

The software component leads VLA market revenue, commanding the largest segment share through model training frameworks, inference APIs, and fine-tuning platforms. The robotics application segment dominates adoption across the forecast, anchored by industrial robot instruction-following and manipulation task deployment. Cloud deployment mode leads over on-premises configurations, driven by scalable inference infrastructure from hyperscalers including Amazon Web Services, Microsoft Azure, and Google Cloud. Among end-user verticals, manufacturing is the fastest-growing segment, driven by robotic assembly, quality inspection, and autonomous logistics automation investment.

RT-2 to GR00T N1.7: How Three Architectural Philosophies Are Splitting the Market

Google DeepMind established the paradigm with RT-2 in mid-2023, co-training vision-language models PaLI-X and PaLM-E on physical robot demonstration data to produce a system capable of both web-scale reasoning and real-world robotic control. Its successor, Gemini Robotics, built on Gemini 2.0, added physical actions as a direct output modality. The September 2025 Gemini Robotics 1.5 update introduced agentic capabilities that make reasoning transparent to operators during multi-step task execution.

Stanford University's OpenVLA, released in June 2024, reoriented the landscape by demonstrating that a 7-billion-parameter open-source model trained on 970,000 robot demonstrations from the Open X-Embodiment dataset outperformed Google DeepMind's 55-billion-parameter RT-2-X by 16.5 percentage points across 29 evaluation tasks. Seven times fewer parameters, better real-world performance. That result made VLA adoption economically viable outside hyperscaler procurement channels.

NVIDIA's GR00T N1.7 Early Access, a 3-billion-parameter dual-system VLA with Apache 2.0 licensing via HuggingFace and GitHub, is NVIDIA's direct play for the open developer ecosystem, with early adopters including AeiRobot, Foxlink, NEURA Robotics, and Lightwheel. NVIDIA identified the first scaling law for robot dexterity in developing this architecture: moving from 1,000 to 20,000 hours of human egocentric training data more than doubles task completion rates.

Physical Intelligence's physical AI π0.5 uses diffusion-based action generation rather than discrete token prediction, producing superior dexterity on manipulation benchmarks. Figure AI's Helix, running on the Figure 02 humanoid at BMW's Spartanburg plant, processes multimodal inputs on embedded NVIDIA GPUs to generate real-time motor commands. Three philosophies, three deployment hypotheses. Each has production evidence in its favor.

Why the 26.40% CAGR Is Not Optimism: Four Structural Forces Behind the Forecast

One Policy, Any Robot: The Cross-Embodiment Breakthrough That Rewired the Procurement Model

The single biggest technical unlock in recent VLA research is cross-embodiment learning: training one model on demonstrations from many different robot hardware platforms and discovering that the resulting policy performs better on each individual hardware type than a body-specific model trained on the same per-robot data alone. The Open X-Embodiment consortium's proof point, RT-X exhibiting 50% positive transfer across robot morphologies, changed the investment thesis for industrial automation buyers. Instead of procuring separate software for each robot platform in a factory, operations teams can fine-tune a single foundation model per use case, then deploy it across heterogeneous fleets.

Semiconductor fabrication operators reported in Q1 2026 that the ability to retask a robot arm in hours, versus weeks for traditional reprogramming, is unlocking entirely new use cases in wafer handling, PCB inspection, and component placement. The retasking speed advantage compounds: each new task deployment takes less time than the one before as the fine-tuning process stabilizes around a familiar model architecture.

10,000 Humanoids Shipped in Months: Why Hardware Scale Is Now a VLA Software Demand Signal

Boston Dynamics' electric Atlas began commercial deployment with its 2026 production allocation committed to Hyundai and Google DeepMind. AgiBot produced its 10,000th humanoid in late March 2026, scaling from 1,000 units in 2025 to 10,000 within months. Tesla's Optimus program is targeting one million units per year from its Fremont facility. Figure AI's BotQ facility targets 12,000 units annually.

Hardware scale creates software demand. Every humanoid shipped without a capable VLA policy is hardware that cannot perform its economic function. The dependency runs one direction: VLA software is now the binding constraint on humanoid robot utility, and that constraint is attracting the investment Kaiso Research's primary data reflects.

The Compounding Advantage Most Market Participants Are Underestimating

Kaiso Research's analysis of demand curves across the robotics software sector identifies data flywheel dynamics as a compounding growth driver that most market participants are underestimating in their near-term forecasts. The pattern is structurally identical to what drove large language model capability gains between 2020 and 2023: scale the training data, observe capability jumps, attract more deployments that generate more training data, repeat.

The VLA version of this loop is physically grounded. Each robot deployment generates proprioceptive, visual, and action data that feeds back into model improvement cycles. Physical Intelligence, Figure AI, and NVIDIA are each building data collection infrastructure: teleoperation systems, simulation pipelines, and synthetic data generation, explicitly to accelerate this loop. The DROID dataset, comprising 76,000 trajectories across 564 scenes, and internet video pre-training approaches using platforms like YouTube egocentric footage, demonstrate that the data ceiling for VLA training is far higher than the current corpus.

From Capex to Opex: AWS and Azure Have Changed Who Can Afford to Deploy VLA Models

Microsoft Azure's Robotics AI Platform, launched in late 2024 with optimized inference for transformer-based action models, and AWS's equivalent hosted model services have materially reduced the technical barrier to enterprise VLA deployment. Procurement teams that previously required significant on-premises GPU infrastructure can now access VLA inference via API, paying per inference call rather than per hardware unit. This changes the procurement model from capital expenditure to operational expenditure, a shift that typically accelerates adoption cycles in enterprise technology markets.

Four Developments Reshaping VLA Deployment in 2026: Speed, World Models, Autonomous Vehicles, and Open Source

Inference Optimization Unlocking Real-Time Deployment

An LLM that takes two seconds to begin streaming output is acceptable. A robot that takes two seconds to react to a falling object is dangerous. Real-time VLA deployment requires inference at 30 to 100 Hz on hardware constrained by the robot's compute budget. Three engineering techniques now dominate production deployments and are changing the cost structure of VLA inference. First, action chunking: instead of predicting one action per inference step, the model emits chunks of 8 to 50 future actions simultaneously, which the robot executes open-loop while the next chunk computes. This reduces effective inference frequency dramatically while improving motion smoothness. Second, quantized VLA models that run at 10 to 25 Hz on consumer-grade AI servers have made real-time manipulation loops compatible with hardware that costs a fraction of datacenter-grade alternatives. Third, dual-system architectures, adopted by GR00T N1.7 and Figure's Helix, decouple slow reasoning from fast control, allowing high-frequency motor commands without waiting for the full language-grounded reasoning cycle to complete.

World Model Convergence

The boundary between VLA models and world models is dissolving. A VLA maps current observations to actions. A world model predicts future states. NVIDIA's Cosmos World Foundation Model has been absorbed into GR00T N1.7's VLM backbone, enabling the model to reason about consequences of actions before executing them, a capability that addresses one of the core failure modes of early VLA architectures: irreversible errors caused by action selection without anticipation. Physical Intelligence and DeepMind are pursuing similar convergence paths. The result is a generation of VLA systems that do not merely react to the current scene but plan across time-extended action sequences.

Autonomous Vehicles Adopting End-to-End VLA Architectures

Li Auto released what it described as the world's first VLA driver model in September 2024. At NVIDIA GTC 2026, DeepRoute.ai unveiled a 40-billion-parameter VLA Foundation Model architecture for autonomous driving, collapsing the traditional modular autonomous driving stack, covering separate perception, prediction, planning, and control subsystems, into a single unified model that jointly handles all functions. XPeng Motors obtained a Level 3 autonomous driving road test license in Guangzhou and targets Level 4 mass production vehicles in 2026. Li Auto's MindVLA architecture integrates spatial intelligence, language intelligence, and behavioral intelligence, with planned mass production implementation in 2026.

The shift from modular autonomous driving stacks to unified VLA architectures is not primarily a technical preference. It reflects a bottleneck diagnosis: the interface between separate perception and planning subsystems has been the failure point in edge-case scenarios. Collapsing those interfaces into a single model trained end-to-end removes the architectural seam where errors compound.

Open-Source Foundation Model Ecosystem Democratizing Adoption

Meta AI and Stability AI VLA model releases are creating accessible foundation model adoption outside hyperscaler procurement channels. OpenVLA's Apache-licensed release enabled rapid academic iteration. OpenVLA exceeded 1,000 citations within 12 months of release. NVIDIA's GR00T N1.7 Early Access with Apache 2.0 licensing is the commercial equivalent, targeting the developer ecosystem directly. Open-weight models with permissive licensing change who can compete in VLA-powered products. A logistics startup in Southeast Asia can now build on the same foundational model architecture as Toyota Research Institute, with fine-tuning costs measured in days rather than months.

Why Benchmark Scores Are the Wrong Metric: The Three Architectural Choices That Actually Determine Deployment Success

Three architectural choices separate deployments that work from those that fail. First, token representation: RT-2 and OpenVLA represent robot actions as discrete language tokens, enabling co-fine-tuning on vision-language and robot data but limiting precision. Physical Intelligence's π0 uses continuous diffusion-based distributions, recovering dexterity at the cost of training simplicity. The 2026 ICLR research consensus leans toward adaptive hybrid representations.

Second, world model integration: NVIDIA's embedding of Cosmos World Foundation Model predictions into GR00T N1.7 reduces irreversible manipulation errors, as confirmed by early deployment data from AgiBot and Foxlink. Third, the generalization boundary: ICLR 2026 research confirms that downstream VLA task performance has no reliable correlation with VLM backbone scores on standard benchmarks. A backbone that tops VQA rankings does not necessarily produce a better robot policy. Selecting a foundation model requires task-specific evaluation on actual deployment hardware. Benchmark proxy scoring is not a substitute.

Google DeepMind Has the Best Models. NVIDIA Has the Best Ecosystem. Who Wins?

Google DeepMind holds the strongest model capability position through the Gemini Robotics family and its access to the Open X-Embodiment consortium data across 22 hardware platforms. The limitation is commercial accessibility: Gemini Robotics remains available only to trusted testers as of mid-2026. NVIDIA commands the deployment ecosystem axis through Isaac simulation, Cosmos World Foundation Model, and GR00T N1.7's Apache 2.0 licensing, creating a vertically integrated stack from synthetic data generation to edge inference on NVIDIA-powered hardware. Every humanoid manufacturer running NVIDIA compute is a potential GR00T deployment. Physical Intelligence holds the strongest dexterity benchmark position with π0.5, but its strategic challenge is converting that leadership into defensible distribution before a well-resourced incumbent closes the gap.

China's VLA programme is running on a parallel track that Western competitors cannot replicate. DeepRoute.ai's 40-billion-parameter VLA architecture, unveiled at GTC 2026, targets one million autonomous vehicle deployments by end of 2026. Li Auto's MindVLA, Baidu Apollo, and XPeng's Level 4 programme represent coordinated domestic investment backed by government frameworks and municipal testing approval in Guangzhou and Beijing. The competitive question is not just which architecture wins. It is which geography controls the deployment data that trains the next generation.

$1.9 Billion Across 47 Deals in 2025: What the Shift to Commercial Milestone Rounds Signals

Total venture capital investment in humanoid robotics, the primary hardware category driving VLA software demand, exceeded USD 3 billion in 2024 alone. Figure AI's September 2025 USD 1 billion raise at a USD 39 billion valuation represents the single largest round. Across Kaiso Research's tracking of disclosed deals in the VLA and physical AI software segment, approximately USD 1.9 billion moved across 47 disclosed transactions in 2025.

The investment thesis in 2026 has shifted from "can VLA models work?" to "which VLA deployment strategy generates defensible software revenue at enterprise scale?" That thesis shift is visible in deal structure: later-stage rounds with commercial deployment milestones replacing early-stage bets on research capability. AWS and Microsoft Azure's launch of dedicated VLA inference infrastructure signals that the hyperscalers have resolved their internal debate about market timing. When infrastructure providers build dedicated hosting capacity, the revenue inflection is typically 12 to 24 months away.

Regulatory and Policy Developments

The regulatory environment for VLA systems is forming around two distinct frameworks that will shape deployment velocity differently across geographies.

The EU AI Act, which began enforcement in February 2025, classifies AI systems in autonomous vehicles, healthcare applications, and public infrastructure as high-risk, requiring pre-market testing documentation, performance benchmarking, and human oversight provisions before commercial deployment. For VLA systems in those categories, compliance infrastructure is now a prerequisite cost that factors into total deployment economics. Organizations deploying in EU markets without documented testing and oversight protocols face material regulatory exposure.

The United States operates without a unified national AI framework but has active agency-specific oversight: FDA guidance applies to VLA systems in medical device applications, NHTSA has autonomous vehicle AI guidance that increasingly references end-to-end model architectures, and sector-specific frameworks are developing faster than congressional legislation. This creates an asymmetric regulatory environment: US companies face lower compliance friction for domestic deployment but increasing complexity for EU market entry.

China mandates pre-approval of AI algorithms under its existing algorithmic regulatory framework, which applies to VLA systems deployed in consumer-facing applications. The domestic autonomous vehicle VLA deployments by DeepRoute.ai, Li Auto, and XPeng are proceeding under approved testing frameworks coordinated with municipal governments in Guangzhou and Beijing, reflecting an approach that treats regulatory approval as a competitive resource managed at the company-government relationship level.

The pattern across geographies: safety-critical VLA deployments in healthcare, autonomous vehicles, and public infrastructure will face increasing compliance requirements everywhere. VLA deployments in industrial manufacturing, logistics automation, and retail operations face substantially lower regulatory barriers and shorter compliance paths to revenue.

Strategic Implications for Businesses

For manufacturers and industrials: The retasking speed advantage VLA models provide, measured in hours versus weeks for traditional reprogramming, is not a marginal productivity improvement. It is a different operating model for factory flexibility. Plants with VLA-equipped robot fleets respond to product variant changes and production disruptions without the programming lead time that currently constrains flexible manufacturing. The procurement pathway decision matters: hyperscaler inference APIs favor rapid iteration at lower volume, while open-weight fine-tuning on NVIDIA GR00T N1.7 or Stanford OpenVLA suits high-volume, latency-sensitive deployment. Either way, the organizations that begin structured teleoperation data collection now will enter procurement negotiations months ahead of competitors who wait.

For investors and healthcare technology organizations: Kaiso Research's primary dataset identifies the fine-tuning infrastructure layer as the highest-margin software opportunity in the near-term forecast. Commercial value is concentrating in deployment tooling and domain-specific fine-tuning rather than in the base foundation models themselves, which are converging toward open-weight architectures. In healthcare, VLA applications in surgical assistance and patient care are moving toward clinical trial timelines in 2026. The regulatory pathway under FDA and EU AI Act high-risk classification means organizations beginning compliance documentation now are 18 to 24 months ahead of competitors. The organizations deploying VLA-powered healthcare robotics in 2028 are the ones building compliance infrastructure in Q3 2026.

Challenges and Risks

The 26.40% CAGR projection is not a guarantee. Three structural risks deserve direct assessment.

Real-time inference constraints remain unresolved at the edge. Quantized VLA models running at 10 to 25 Hz on consumer-grade GPUs satisfy many manipulation loop requirements, but the most dexterous applications, including surgical robotics, precision electronics assembly, and fine object manipulation, require higher-frequency control that current architectures do not consistently deliver without dedicated inference hardware. The hardware cost of meeting this requirement at the edge limits addressable market in cost-sensitive deployment contexts.

Training data for safety-critical deployments is constrained by liability. The data flywheel that accelerates consumer robotics and industrial automation development stalls in healthcare and autonomous vehicles because deployment failures generate liability exposure that discourages data collection in edge cases. The scenarios where VLA models most need training data, such as unexpected patient movements during surgical procedures and novel road conditions in autonomous driving, are precisely the scenarios where deployment at scale is most restricted. Synthetic data generation and simulation are partial mitigations, but the gap between simulated and real-world distribution coverage is a known failure mode.

Cross-embodiment transfer has hard limits that benchmark data understates. The Open X-Embodiment result showing 50% positive transfer across robot morphologies is a foundational achievement. It also means 50% of the transfer was not positive. For industrial deployments where failure modes carry capital costs including equipment damage, production line stoppages, and safety incidents, the reliability bar is higher than research transfer benchmarks measure. Organizations deploying VLA models on expensive hardware in critical production environments should expect longer validation timelines than academic benchmarks suggest.

Future Outlook

Kaiso Research projects the VLA Models Market at USD 40.50 billion by 2035 across a non-linear trajectory. The 2026 to 2028 window generates revenue primarily through software APIs, fine-tuning services, and inference infrastructure as enterprise pilots convert to production. The largest absolute revenue concentration sits in 2028 to 2032, when cross-embodiment generalization matures enough that a single foundation model fine-tune covers an entire heterogeneous robot fleet, creating a direct multiplier on humanoid hardware scale from Figure, Tesla, and Boston Dynamics.

The 2032 to 2035 phase sees market consolidation around three to five dominant platform VLA providers, with commercial value shifting to domain-specialized fine-tuning in verticals where general models fall short. The question executives should be asking now is not whether to evaluate VLA integration. It is which of their three-year capital projects assume competitors still need weeks to retask a robot. Those projects are already at risk.

Conclusion

The transition from keyword-programmed industrial robots to VLA-driven generalist robot policies is not a future scenario. By Q1 2026, eleven commercial deployments across manufacturing, healthcare-adjacent logistics, and autonomous vehicles are using VLA models as their primary policy backbone. Fine-tuning costs that previously required thousands of demonstrations now deliver superior results from hundreds. Inference speeds that previously required datacenter hardware now run on embedded chips inside humanoid bodies.

Kaiso Research's primary data puts the market at USD 3.89 billion in 2025. The path to USD 40.50 billion by 2035 runs through hardware scale, data flywheel acceleration, and the consolidation of the modular robot software stack into a small number of dominant foundation models, a structural transition with a clear historical analogue in how large language models reshaped enterprise software between 2020 and 2024.

What makes this market unusual is the physical stakes. A language model that hallucinates produces a wrong answer. A VLA model that hallucinates moves a robot arm into a person, a machine, or a production line. The organizations that win the next decade of this market will be the ones that understood early that the hard problem was not model capability. It was trust. Trust in edge cases. Trust in high-frequency control. Trust under regulatory scrutiny. The 26.40% CAGR reflects a market learning, fast, that this trust can be built.

About Kaiso Research and Consulting Kaiso Research and Consulting is a global market intelligence firm publishing 5,000+ research reports across 11+ industry verticals. kaisoresearch.com | [email protected] | +1 872 219 0417

Lead Industry Analyst, Kaiso Research and Consulting | Covering AI, Robotics, and Emerging Technology Markets Published: 2026-06-06 | Report Code: IMEC1127

Sample Report Available at: https://www.kaisoresearch.com/report-store/global-vision-language-action-models-market/sample-request

Similar Reports

Vision-Language-Action (VLA) Models Market Size, Trend and Opportunity Anal...

Similar Blogs

Similar Newsletter

Latest Blogs

2026-07-14T18:30:00.000Z

Large Language Models Are Becoming Enterprise Infrastructure. The Market Reaches $177.8B by 2035

2026-07-08T18:30:00.000Z

AI Cybersecurity Hits $292.53B by 2035. Identity, Not Malware Detection, Decides Who Wins

2026-07-07T18:30:00.000Z

Renewable Energy Market Hits $6.55T by 2035 as the US Cedes Share to Asia