The Utilization Gap

Why the AI industry does not have a compute crisis. It has a utilization crisis.

Apr 28, 2026

I. The Wrong Crisis

“More GPUs. More data centers. More power.”

“The constraint is compute. Deploy more of it.”

“Whoever secures the most silicon wins!”

This is the story the AI industry has been telling itself for two years. And the capital markets believed it. Hyperscalers committed over $600 billion in AI infrastructure spending for 2026. GPU lead times stretched past 36 weeks. Countries began treating chip access as a matter of national security.

Is the story wrong? No, but it is incomplete. And the part it leaves out significantly reframes where infrastructure value accrues.

A report from Cast AI, published this month, analyzed tens of thousands of Kubernetes clusters across AWS, Azure, and Google Cloud. The finding: average GPU utilization across AI and machine learning workloads is 5%.

FIVE PERCENT!

The AI industry is desperate for more and more GPUs and YET wasting 95% of the ones it has. That is not a supply problem. It is a systems problem.

The constraint on AI performance is not compute. It is the ability to USE compute. The memory hierarchy cannot feed GPUs fast enough. The network fabric cannot move data between them efficiently. The software orchestration layer cannot schedule workloads intelligently enough to keep expensive silicon productive.

I call it THE UTILIZATION GAP.

Billions of dollars in deployed infrastructure sitting idle because the system around the GPU cannot keep pace with the GPU itself.

In The Concentration Thesis, I argued that inference will concentrate, not commoditize. This essay is the prequel. Before we debate who wins the inference market, we need to confront the fact that most inference infrastructure is not being used at all.

Everyone is asking: how do we get more compute? The better question: how do we use what we already have?

This is Part I of a two-part series. This essay covers where infrastructure value accrues. Part II, The Inference Tax, covers how these same constraints shape which AI products get built.

II. The Memory Wall

The first bottleneck is memory.

Large language models use a mechanism called a KV cache, short for key-value cache, to store the context accumulated during inference. Every token the model processes generates key and value vectors that must be retained so the model can attend to previous context when generating the next token. This cache lives in the GPU’s high-bandwidth memory, or HBM.

You can guess the problem! HBM is finite. Context windows are not.

When the KV cache fills HBM, the GPU faces two options. It can evict cached data and recompute it later when needed. Or it can stall, waiting for data to arrive from slower storage tiers. Both options are devastating for production economics. The numbers make this concrete. Without KV cache persistence, time-to-first-token at one million tokens is over six minutes.

At ten million tokens, it is 8.3 hours. An entire workday. The GPU is recomputing previously completed work or waiting for data. WHAT A WASTE!

A new category of infrastructure is forming to solve this. The fix borrows from decades of CPU cache architecture. Multi-level caching and intelligent prefetching solved CPU stalls in the 1990s. Inference systems are catching up now. Lightbits Labs, VAST Data, WEKA, and DDN are all building storage layers optimized for KV cache persistence and retrieval, each approaching the problem from a different architectural starting point. Lightbits tiers the cache from HBM through DRAM to NVMe with intelligent prefetching. VAST and WEKA are integrating KV cache management into their data platforms, with WEKA's Augmented Memory Grid plugging directly into Nvidia's BlueField-4 STX architecture. The benchmarks across these solutions show the same pattern: orders-of-magnitude improvements in time-to-first-token at long context lengths, turning workloads that were physically impossible on commodity hardware into production-viable deployments.

Nvidia validated the entire category. At CES 2026, it announced the Inference Context Memory Storage Platform (ICMSP). At GTC, it expanded this with the CMX architecture and STX reference design, backed by BlueField-4 DPUs. The partner list reads like a who’s who of enterprise storage: DDN, Dell, HPE, Hitachi Vantara, IBM, Nutanix, Pure Storage, Supermicro, VAST Data, and WEKA. When Nvidia builds a reference architecture around a problem, the problem is real and the market is about to move.

III. The Network Wall

The second bottleneck is networking.

GPU compute is scaling exponentially. Networking is not. As clusters grow toward hundreds of thousands of accelerators, the interconnect fabric becomes the binding constraint on how much of that compute actually gets used.

Google researchers published a paper in early 2026 identifying this gap directly. The primary threats to inference economic viability, they argued, are memory and network latency, not compute. The paper proposed low-latency topologies and processing-in-network as potential mitigations. The implication was stark: the industry has been optimizing the wrong layer.

The physics are unforgiving. Adding more GPUs to a cluster can actually decrease effective utilization if the network fabric cannot move data between them fast enough. Every GPU waiting for data from another GPU is a GPU doing nothing. At the scale of modern training and inference clusters, these idle cycles compound into billions of dollars of stranded value.

General-purpose networking was built for web traffic. AI clusters need something closer to a nervous system.

Eridu emerged from stealth this year with over $200 million in funding to build that nervous system from scratch. No retrofitting. A purpose-built network switch architecture that delivers single-hop scale-up domains with thousands of GPUs and scale-out domains with millions. The claimed economics: up to 50% capex savings on networking infrastructure and 70% lower networking power.

Arrcus is approaching the problem differently, building policy-aware network fabric that dynamically steers inference traffic based on real-time constraints: latency, throughput, power capacity, data residency, and cost. The company projects over $100 million in bookings for 2026 and recently launched the Arrcus Inference Network Fabric specifically for distributed AI inference.

I saw this dynamic from the hardware side. I spent a couple of years building inference accelerators at an AI chip company that was acquired by AMD. One lesson became clear early: you could design the fastest accelerator in the world, and it would still sit idle if the system around it could not feed it data fast enough. The chip is only as good as the slowest link in the chain. For training, that link was often interconnect bandwidth between GPUs in a cluster. For inference at scale, it is increasingly the network fabric connecting distributed inference nodes across a data center or across regions.

IV. The Orchestration Gap

The third bottleneck is software. And it is the one closest to my lane.

Even when memory and networking are solved at the hardware layer, software orchestration determines whether utilization actually improves in practice. The Cast AI data tells this story clearly. The minority of organizations reporting higher GPU utilization, 49% on H200s and 30% on H100s, are not running better hardware. They are running better automation. GPU-aware scheduling. Dynamic batching. Automated lifecycle management. The gap between 5% average utilization and 49% is not a silicon gap. It is a software gap.

The dominant discourse in AI infrastructure is fixated on the trifecta of compute: GPUs, power, and cooling. Fidelity writes about it. Blackstone builds decks around it. Every retail investing blog ranks the same five hardware stocks. But the software layer that turns expensive hardware into productive infrastructure, the layer that determines whether a $4 million rack of B200 GPUs earns revenue or burns cash, is undercapitalized relative to its impact.

Semantic caching reduces redundant API calls by 30 to 50% for typical enterprise deployments. A company running customer support agents can cut GPU spend by 40% with semantic caching alone, because most support questions are variations of the same twenty queries.

Model routing directs requests to the cheapest capable model for each task. Workload-aware scheduling packs jobs so GPUs do not sit idle between training runs. Inference-specific observability tells operators which workloads are underperforming and why.

None of this is glamorous. All of it is the difference between infrastructure that pays for itself and infrastructure that destroys capital.

Here is the connection to a claim I made in The $16 Trillion Flip: AI services margins depend on inference costs, and inference costs depend on utilization. An AI services company running its workloads at 5% GPU utilization is paying roughly 20 times more per productive token than one running at 50%.

That gap is not a market subsidy. It is an OPERATIONAL FAILURE.

And it means the margin differences between AI services companies will be determined less by which model they use and more by how efficiently they use the infrastructure underneath it.

This is not a theoretical risk. AWS raised H200 Capacity Block prices by 15% in January 2026, breaking a two-decade trend of declining compute costs. When the price of waste goes up, the value of efficiency goes up with it.

V. Where value accrues

The memory wall, the network wall, and the orchestration gap are not independent problems. They have a dependency order.

Memory has to be solved before networking improvements fully pay off. There is no point moving data between nodes faster if each node is stalling on KV cache evictions. Networking has to be solved before orchestration software can maximize utilization. Intelligent scheduling across GPUs that cannot communicate efficiently just produces faster congestion.

For investors, this sequencing matters. The memory layer is the most derisked today. Nvidia has validated the category with ICMSP and CMX. Multiple companies are shipping product. The risk is commoditization from below as Nvidia folds caching into its own stack. The network layer is the highest-capex, longest-duration bet, but the physics gap between general-purpose and AI-native networking is wide enough that the winners will build durable moats. The orchestration layer carries the highest margin potential because it is pure software, but its full impact depends on the lower layers maturing first.

The signal for investors evaluating AI infrastructure is not price per token. It is UTILIZATION PER DOLLAR of deployed infrastructure.

VI. The Close

The AI industry spent 2024 and 2025 in a compute arms race. More GPUs. More data centers. More power. The assumption was that compute is the scarce resource and deploying more of it solves the problem.

That assumption is breaking. The scarce resource is not compute. It is the ability to use compute effectively.

The memory wall stalls GPUs waiting for data. The network wall stalls GPUs waiting for each other. The orchestration gap leaves GPUs idle because software cannot schedule work intelligently. Together, these three bottlenecks create the Utilization Gap: the difference between what AI infrastructure costs and what it actually produces.

The most expensive GPU is not the one you cannot buy. It is the one you already own that is doing nothing.

This essay was inspired by patterns observed while building inference accelerators at Untether AI (acquired by AMD), by the emerging KV cache infrastructure category, and by a growing conviction that AI infrastructure value is migrating from compute deployment to compute utilization. If you are building or investing in the layers that close the utilization gap, I would love to compare notes.

In Part II, The Inference Tax, I look at the other side of this gap: how inference economics are silently shaping which AI products get built, which features get killed, and why a new product skill, INFERENCE LITERACY, is becoming the difference between AI companies that scale and those that stall.

Previously: The Taste Gap, The Conviction Tax, Building in Public Judging in Private, When Distribution Isn’t Enough, The Pruning Principle, The $16 Trillion Flip, The Concentration Thesis, The Barrel Upgrade

Chip on My Shoulder

Discussion about this post

Ready for more?