The Inference Tax

How inference economics shape which AI products get built, which features get killed, and why inference literacy is becoming a core product skill.

May 10, 2026

I. The Other Side of the Gap

In The Utilization Gap, I argued that the AI industry has a utilization crisis, not a compute crisis. That essay was about infrastructure: the memory wall, the network wall, the orchestration gap, and the companies closing each one.

This essay is about what is happening on the other side. The utilization gap does not just waste hardware. It shapes products. And the user experience.

Every AI product on the market today is a compromised version of what its builders wanted to ship. The compromise is not model capability. The models can do more than most products allow. The compromise is economics. Inference cost structure acts as an invisible tax on product ambition, quietly killing features, capping context, truncating memory, and degrading quality in ways the end user never sees and the product team rarely names.

I am going to call it THE INFERENCE TAX.

The tax is not a line item on a P&L. It is the delta between what the product could do and what inference economics allow it to do.

II. The Pareto Frontier Is a Product Decision

Inference systems live on a three-way tradeoff: accuracy, latency, and cost. Pick two, engineer around the third. This is not a new idea. It is the Pareto Frontier applied to AI serving. A larger model improves accuracy but increases cost and latency. A smaller model reduces cost but risks quality degradation. Optimizing for speed can sacrifice reasoning depth. The frontier shifts outward as engineering matures, but the tradeoff never disappears.

The infrastructure world understands this well. But this is not the first time an infrastructure constraint has silently shaped a product.

When Steve Jobs unveiled the original iPhone in 2007, he announced that third-party apps would be web-based. The public narrative was forward-looking: web apps were the future, open and universal. The real reason was constraint. Apple did not have the SDK, the security model, or the review infrastructure ready for native third-party apps. The product shipped a compromised version of itself because the platform underneath was not ready, and the compromise was presented as a deliberate design philosophy. The App Store launched a year later, when the constraints were solved. Early Netflix tells a similar story. Streaming launched with a thin catalog of older titles, not because Netflix wanted a limited library, but because licensing costs for new releases were prohibitive at their streaming unit economics. The DVD business subsidized streaming until the cost structure improved. Users saw “limited selection.” The real constraint was economics per stream.

The inference tax follows the same pattern. Every product decision in an AI-native company is a position on the Pareto frontier. Great PMs know it, most do not.

Choosing a larger model is choosing higher cost or higher latency. Choosing a longer context window is choosing higher KV cache pressure. Choosing an agentic workflow is multiplying all three, because a single user interaction with an agent can trigger dozens of sequential inference calls, each consuming context that grows over the session.

Most product teams treat these as independent decisions. Model selection happens in one meeting. Context window defaults get set in another. Latency targets land in a PRD with no connection to either. They are not independent. They are positions on a cost surface, and the shape of that surface is determined by the utilization gap underneath.

Every AI application making decisions about context window size, agent memory depth, or multi-turn conversation length is implicitly making a decision about memory infrastructure. Most product builders do not know this. They see “slow” or “expensive” and scope their features accordingly. They cap context windows. They truncate agent memory. They design around limitations they cannot name. Just as Jobs called web apps a philosophy and Netflix called a thin catalog a launch strategy, today’s AI product teams call inference constraints “MVP scoping.”

III. Three Products, Three Taxes

The inference tax is not theoretical. It is visible in the product architecture of every major AI-native company. Here are three.

Cursor: Cursor, the AI code editor built by Anysphere, has reportedly crossed $2 billion in annualized revenue. In June 2025, the company switched from flat “fast request” allotments to usage-based credit pools tied to actual token consumption. The reason was explicit: as users leaned on expensive frontier models, flat per-request pricing was unsustainable.

The product response was architectural. Cursor built Auto mode, a model router that selects the cheapest adequate model for each task. Auto mode is unlimited on paid plans. Manual selection of premium models like Claude Sonnet or GPT-4o burns credits. Tab autocomplete runs on a separate fast, cheap model that never touches the credit pool.

The result: the product itself is an inference cost optimization layer. Users who understand the economics use Auto mode and stretch their credits indefinitely. Users who do not, selecting frontier models for every task, burn through their entire monthly budget in days. As one developer put it, using Claude Opus for a simple CSS fix “is like hiring a Formula 1 driver to deliver pizza.”

OpenAI: When OpenAI launched GPT-4o mini in July 2024, it replaced GPT-3.5 as the default free-tier model in ChatGPT. The framing was “advancing cost-efficient intelligence.” The reality was inference economics: GPT-4o mini costs 99% less per token than the original text-davinci-003.

The entire ChatGPT tier structure is an inference cost ladder. Free users get a capable but cheap model. Plus subscribers get frontier access. The API extends this further: GPT-5 nano at $0.05 per million input tokens on one end, GPT-5.4 Pro at $30 per million on the other. A 600x price spread across the same product family. The tiers are not feature gates. They are inference economics made visible.

Perplexity: Perplexity runs multiple frontier models through a “Model Council” that cross-checks answers across providers. But the inference tax shows up in the constraints. Context windows are smaller when you access models through Perplexity than when you use them directly. Users on Reddit have reported limits on premium model usage being changed without transparent communication. The $200/month Max tier gates the most powerful agentic features, including access to o3-pro and Perplexity Computer. Free users get roughly 5-10 Pro searches per day.

This is inference cost managed through feature gating and context window throttling. Does the user ever see the constraint? No. They just see “upgrade.”

The pattern across all three: every one of these companies has built a product architecture that is fundamentally an inference cost management strategy. From model routing to tiered access to context caps to credit systems. These are not product features. They are inference tax strategies.

IV. The Invisible Product Constraint

The three examples above are the visible cases. The companies that manage the inference tax well enough that you can see the architecture. The more common case is the company that does not manage it at all.

Context length is the sharpest lever. A product team defaulting to the maximum context window supported by a model is making an infrastructure cost decision without knowing it. KV cache memory scales linearly with context length and concurrency. Fifteen users at 32K context create the same memory pressure as sixty users at 8K. At 128K+ context, even a single long-running session can materially reduce how many concurrent users the system can serve. The model weights never changed. KV cache growth drove the constraint.

Latency budgets work the same way. A product manager specifying a 200-ms response time for an agentic workflow has no visibility into whether that budget is being consumed by model inference, by KV cache retrieval, or by data movement between nodes. The infrastructure is a black box. The PM just sees “slow” and scopes the feature down.

Response design is another. Verbose responses consume more GPU cycles per request, reducing how many requests each GPU can serve per hour. A product that defaults to long, detailed answers when the user wanted two sentences is not just a UX problem. It is a cost problem. Token discipline is cost discipline, even when no one on the product team is tracking it.

Microsoft’s Azure engineering team arrived at the same conclusion from the infrastructure side. They published a three-part series in March 2026 calling inference “a capital allocation problem” and arguing that product design decisions directly affect inference economics. When the infrastructure layer starts telling product teams to pay attention to token consumption, the inference tax is no longer invisible. It is becoming a line item.

The claim: the AI products we use today are not shaped primarily by model capability. They are shaped by inference economics. The features that get built are the features the cost structure allows. The features that get killed are the ones the cost structure forbids. Most product teams cannot name this dynamic. They call it “performance” or “cost optimization” or “MVP scoping.” It is the inference tax.

V. Inference Literacy

Here is where the argument moves from diagnosis to prescription.

Two PMs facing the same inference budget will make radically different product decisions. One caps context at 32K because “it is too expensive.” The other restructures the architecture, implements semantic caching, routes simple tasks to a model that costs 1/600th of the frontier, and ships 1M context at the same cost. The difference is not budget. It is INFERENCE LITERACY.

Inference literacy is the ability to read the cost surface underneath your product and make architectural decisions that expand what is possible within the same economic constraints. It is knowing that semantic caching can cut 40% of redundant API calls for a customer support product. That model routing can send a CSS autocomplete to a model that costs fractions of a cent while reserving the frontier model for multi-file refactors. That context window management is a product design decision, not an infrastructure default.

Cursor’s Auto mode is inference literacy baked into product architecture. OpenAI’s model ladder is inference literacy baked into pricing. Perplexity’s Model Council is inference literacy baked into search quality. Each company arrived at it differently. All three discovered the same thing: the product cannot be separated from its inference economics.

This is a new product skill. Most product organizations do not have it. The PM who defaults every request to the most expensive model is making the same mistake as a filmmaker who shoots every scene in IMAX. It works, but you are paying IMAX prices for a conversation that could have been a phone call. The skill is matching the tool to the moment.

In The Taste Gap, I argued that taste is the scarcest resource when execution commoditizes. Inference literacy is taste applied to the cost surface. It is knowing which corners to cut and which to protect. Which queries deserve the frontier model and which can be served by something 600x cheaper without the user noticing. That judgment, applied at scale across millions of requests per day, is the difference between an AI company with real margins and one subsidizing every interaction.

VI. The Close

The Utilization Gap asked: how do we use the compute we already have? This essay asks: what does it cost when we do not?

In 2018, Bill Gates went on The Ellen DeGeneres Show and tried to guess the price of grocery staples. He guessed $22 for a bag of Totino’s Pizza Rolls. The actual price was $8.98. The audience laughed. Here was the richest man in the world, a genius at building systems, completely disconnected from what things cost at the level where ordinary decisions get made.

Bill Gates Attempts to Guess the Prices of Everyday Grocery Store Items on 'Ellen'

Most AI PMs are Bill Gates at the grocery store. They have access to the most powerful models ever built. They cannot tell you what a single inference call costs their business, or how that cost shapes the product their users experience. They guess. They scope down. They call it “MVP.”

The inference tax is not going away. Inference costs will decline, but usage will scale faster. The tax will shift, not disappear. The product leaders who understand it will build companies with structurally better economics. Not because they spend less, but because they spend better.

Bill Gates could afford to be wrong about grocery prices. Most AI companies cannot.

This essay was inspired by the emerging practice of inference-aware product design at companies like Cursor and Perplexity, by patterns observed across 200+ company evaluations through Guddi Growth, and by a growing conviction that the intersection of infrastructure economics and product judgment is the most underleveraged skill in AI. If you are building products where inference cost shapes your roadmap, I would love to compare notes.

Previously: The Taste Gap, The Conviction Tax, Building in Public Judging in Private, When Distribution Isn’t Enough, The Pruning Principle, The $16 Trillion Flip, The Concentration Thesis, The Barrel Upgrade, The Utilization Gap

Chip on My Shoulder

Discussion about this post

Ready for more?