When Reasoning Becomes a Scarce Resource, Who Captures the Value

Author: Frank Fu, IOSG. The hole that David Cahn raised in 2023 was never filled on the training side. It was filled on the inference side, and the market has only started to factor it into pricing in the last few weeks. With Nvidia restructuring its financial statements around "service tokens" and Cerebras' IPO receiving 20 times oversubscription, the bottleneck debate is over. The real question has become: when inference becomes a scarce resource, where on the computing stack will value be deposited? I. Following the GPU: From the $200 Billion Problem to the $600 Billion Problem In 2023, David Cahn of Sequoia raised the question hanging over the entire AI construction: the "$200 billion problem." For every $1 spent on a GPU, about another $1 is spent on powering it in a data center. Therefore, each year's GPU CapEx means that these chips must ultimately generate about $200 billion in revenue to recoup this capital. Even with very generous assumptions about AI revenue, he still found a gap of over $125 billion between "investment" and "actual end-customer payments." The concern was straightforward: GPUs were being overbuilt ahead of actual demand. A year later, the gap hadn't narrowed; it had widened. In his 2024 sequel, Cahn redefined it as the "$600 billion problem" as hyperscale vendors' CapEx ballooned. The bearish logic converged into a familiar shape: overbuilding leads to oversupply, and oversupply burns capital. Both articles essentially ask the same question: who will fill this gap? The answer never appeared on the "training" side of the ledger. It appeared on the inference side, and the market only started pricing it in in the last few weeks. II. Cerebras IPO and Inference Squeeze Cerebras went public on Thursday. The IPO was oversubscribed 20 times, with the price nearly double Wednesday's final markup. The demand doesn't stem from bets on the "next Nvidia killer," but from something simpler: the market is realizing that the real bottleneck in AI is inference, not training. Cerebras's forte is a chip architecture that enables extremely fast inference. Not training, but inference. This is precisely what excites Wall Street. The inference market is recurring; it expands with usage. Every time Claude answers a question, every time an agent performs a task, it consumes computing power. Training only happens once; inference never stops. JP Morgan estimates the inference market to be 10 to 50 times larger than the training market. When machines begin performing tasks assigned by other machines—agent-like expansion—the demand for inference no longer expands with the number of users, but with the computing power itself.III. Nvidia Redraws its Landscape: Inference Takes Center Stage. If Cerebras was a market awakening, then Nvidia's latest quarterly earnings report is confirmation from the top of the industry chain. In the latest earnings call, Jensen Huang made it clear: AI demand is growing parabolically. The reason is simple: agile AI has arrived. Mainstream AI has transitioned from one-time inference to logical inference, and is now entering the agent stage where it can automatically call tools and orchestrate tasks. Huang said, "Tokens are now profitable." In the AI era, computing power equals revenue and profit. This has reshaped the entire industry. Training is the one-time cost of building a model, while inference is the recurring cost of running it. The bottleneck now lies in inference, not training. Nvidia has incorporated this assessment into its financial statements. It now discloses it as two platforms, not one: Data Center and Edge Computing. Data centers (approximately $75 billion in the quarter, up 92% year-over-year) were further broken down into Hyperscale (approximately $38 billion, up 12% quarter-over-quarter) and ACIE, namely AI Cloud, Industrial & Enterprise (approximately $37 billion, up 31% quarter-over-quarter). A new line is Edge Computing: $6.4 billion, up 29% year-over-year, covering the endpoints where administrative and physical AI actually run, such as PCs, workstations, AI-RAN base stations, robots, and automobiles. Edge currently still accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside data centers. This signals that inference is splitting into two fronts: cloud inference in data centers and endpoint inference at the edge, allowing AI to see, move, and act in the physical world. The roadmap follows the same logic: Vera Rubin, shipping from Q3 onwards, boasts inference throughput up to 35 times that of Blackwell; Huang has also given Vera CPUs, designed for academia workloads, a new $200 billion TAM. Every leading modeling company is expected to fully migrate to it on day one. When the world's most valuable companies restructure their financial disclosures around "service tokens," the bottleneck battle is over. The remainder of this article discusses who captures the value when inference (not training) becomes a scarce resource. First, let's define the scope. In these two battlefronts, this article discusses cloud inference, i.e., rented data center GPUs that provide API token services. Endpoint inference runs on the device's own local chip (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU leasing and aggregation stack.Here, consider this a tailwind amplifying the entire inference economy and supporting the bottleneck argument, rather than the market where Hyperbolic and Venice operate, as those two are entirely on the cloud front. Fourth, the squeeze has arrived. Anthropic is the canary in the coal mine. Usage far exceeds pre-configured capacity, and complaints about Claude being "lobectomized" flood the internet, including rate-limited responses, slower inference, and compressed context windows. The solution is raw computing power: In May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with over 220,000 Nvidia GPUs and over 300 megawatts, dedicating it specifically to inference, not training. This capacity unlocked a series of limit changes, each a signal. On May 6th, Anthropic doubled the five-hour limit for Claude Code, removed peak-hour rate limiting, and significantly increased the API rate limit for Opus. On May 13th, Claude Code's weekly limit was increased by another 50% (until July 13th). Then, starting June 15th, it did the opposite of this "generosity": it removed agency and programmatic usage (Agent SDK, headless claude -p mode, CI pipeline) from the flat subscription and placed them in a separately metered credit pool ($20 to $200 per month, billed at API rates). This final step condenses the entire argument into one action: the rate at which agents consume inference far exceeds the design capacity of the flat subscription, therefore it must be priced according to its inherent "recurring cost." Training is a one-time capital expenditure. Inference is a recurring operating cost that accumulates compounded with each new user and each new agent. V. This Stack: Six Layers, One Bottleneck. Every AI application sits on a supply chain that starts at the TSMC wafer fab and ends at the API endpoint: most companies only own one layer. Nvidia has silicon, CoreWeave has bare metal, Together AI has inference optimization, and OpenRouter has model API routing. There's only one exception. VI. Hyperbolic: The only company spanning all three layers. Hyperbolic launched its on-demand GPU marketplace in June 2025. In its first few months, it surpassed 200,000 developers, with adoption spanning cutting-edge AI labs, search, and large consumer platforms. Interestingly, its architecture is unique. Hyperbolic doesn't own a single GPU. Every card comes from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller carriers with idle capacity. This might sound like a weakness, but it's actually a moat. By sitting between GPU suppliers and consumers, Hyperbolic can see real-time data that others can't.It knows who is buying which GPUs, at what price, and at what time. It sees the oversupply before it becomes public and the demand surges before they impact the market. Today, its moat is this multi-cloud aggregation itself. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized, unified pool, allowing developers to rent the cheapest available GPUs anywhere without negotiating with each operator or managing a bunch of accounts. The more clouds it connects to, the deeper the liquidity and the richer the pricing data. Further down the line, the team is exploring how to use this data to model GPU price curves and eventually invest its own capital to smooth supply and demand, acting as a market maker for physical computing power; but this goal is still in its early stages, and what truly compoundes in the present is the aggregation layer. This is the flywheel: more clouds connected → more aggregated supply → deeper market and real-time pricing data → smarter routing. Hyperbolic is the only company that spans the GPU leasing layer, deployment layer, and model API layer simultaneously. VII. Venice: A Mirror Venice is the clearest embodiment of the inference economy at the application layer, and a useful contrast to Hyperbolic's position. It's a privacy-first inference application: an OpenAI-compatible API, plus a consumer subscription, routing requests to approximately 75 models. Crucially, Venice itself doesn't own meaningful computing power. It rents from undisclosed GPU partners and confidential computing providers, and pays cutting-edge labs for pass-through, so its real cost of revenue is inference computing power, not SaaS hosting. Venice is truly selling privacy. This "privacy" doesn't mean turning public computing power into private property, but rather wrapping commoditized inference with a guarantee: no data retention, no use for training, and anonymous requests. The underlying computing power is readily available; the added cost is this privacy layer. Venice's gross profit = subscription price – inference costs paid downstream, and the extra revenue it can collect compared to the bare API price is almost entirely supported by this privacy premium. This is a real business, but a low-margin one, its economic viability constrained by the computing power it purchases. This is precisely why Hyperbolic sits a level above it. If Venice is a gas station, Hyperbolic is an oil refinery. VIII. Why this is important right now: Nvidia restructured its finances around "service tokens." Cerebras' IPO proved the market understands inference is the bottleneck. Anthropic's efforts to increase capacity demonstrate this is a real problem. Agentic and physical AI will amplify demand by orders of magnitude, spanning both cloud and edge computing.And this also completes the loop of the "$600 billion problem" from another perspective. Cahn's bearish logic—over-construction followed by oversupply—is likely to be validated in the end. But oversupply is precisely the optimal market for asset-light aggregators: when GPU prices decline and supply is fragmented across dozens of clouds, the player who doesn't own any hardware and routes every workload to the cheapest available card will profit from the price difference. Hyperbolic is bullish on oversupply, not bearish on it. The company that ultimately wins will not be the one with the most GPUs, but the one that can tell you where the GPUs are, at what price they are available, and route every workload to where it can run at the lowest cost. Hyperbolic is building such a company. It doesn't own GPUs, it's pure software, it's three-layered, but it's building an aggregation layer for the ultimate inference computing power. [IOSG]

RichSilo Exclusive Analysis:

When Inference Becomes the Bottleneck: The Crypto Opportunities in AI’s $600 Billion Problem

The crypto market’s obsession with AI has largely centered around memecoins, liquid staking derivatives, and AI-driven trading bots. Meanwhile, a fundamental shift has occurred in the underlying AI infrastructure that most market participants have overlooked. Frank Fu’s analysis from IOSG reveals a critical inflection point: the bottleneck in AI has decisively moved from training to inference, creating a $600 billion opportunity that the market has only recently begun to price in.

The Shift from Training to Inference

David Cahn’s “$200 billion problem” has ballooned into a “$600 billion problem” as hyperscale vendors’ capital expenditures continue to outstrip actual end-user revenue. While the market initially feared GPU overbuilding would lead to oversupply and capital destruction, the real answer to filling this gap emerged not on the training side but on the inference side.

This isn’t merely an academic debate—it’s being validated by the market’s most significant players. Nvidia’s restructuring of its financial statements around “service tokens” and Cerebras’ 20x oversubscribed IPO aren’t coincidences. They signal that Wall Street and the semiconductor industry now recognize inference as the true bottleneck. As Jensen Huang stated, “Tokens are now profitable,” a remarkable admission that computing power has become directly monetizable through API services.

The inference market is fundamentally different from training—it’s recurring rather than one-time, expanding with each user interaction and agent deployment. JP Morgan estimates it to be 10-50x larger than the training market, and as AI transitions from basic inference to agent-based orchestration, the demand curve becomes nearly vertical.

Tokenization Opportunities in the Inference Stack

For crypto investors, this shift creates several compelling opportunities:

1. Compute Resource Tokenization

The most direct implication is the potential tokenization of compute resources. Nvidia’s framing of “service tokens” suggests we may see financial instruments representing claims on GPU capacity. In crypto, this could evolve into:
– GPU capacity futures or perpetual swaps
– Tokens representing shares in decentralized GPU farms
– Staking mechanisms that allocate idle compute resources to the highest bidder

2. Decentralized GPU Marketplaces

Projects that create decentralized marketplaces for GPU resources could capture significant value as the market becomes increasingly fragmented. The success of Hyperbolic, which aggregates capacity from multiple cloud providers without owning any hardware, demonstrates the power of the aggregator model. Crypto-native implementations could:
– Enable individuals to monetize idle GPU capacity
– Provide transparent pricing mechanisms for compute resources
– Create liquidity markets for GPU capacity

3. AI Agent Oracles and Data Markets

As AI agents begin performing tasks for other agents, the need for reliable, real-time data becomes critical. Crypto projects that provide:
– Decentralized oracle networks for AI data
– Tokenized datasets for training and fine-tuning
– Privacy-preserving data markets for sensitive AI training data

could become indispensable infrastructure.

4. Optimizing Routing Through Token Incentives

The article correctly identifies that the winner won’t be the company with the most GPUs, but the one that can route workloads to the cheapest available resources. This creates an opportunity for:
– Token-based routing protocols that optimize workloads across providers
– Mechanisms for price discovery in fragmented compute markets
– Incentive structures for idle capacity utilization

🚀 Bybit Limited Time: The World's #1 Crypto Platform! Sign up to claim up to 30,000 USDT in rewards, and automatically activate a lifetime 20% Fee Discount!
Join Bybit Now

Risks and Challenges

Despite the compelling thesis, several risks deserve consideration:

Centralization Risk: Even as the crypto community builds decentralized solutions, the underlying hardware remains concentrated in the hands of a few large providers. This centralization could limit the effectiveness of decentralized approaches.

Market Timing: The market may be ahead of itself in pricing in the inference opportunity. The bear case presented by Cahn—that overbuilding leads to oversupply—could still materialize, creating short-term headwinds for infrastructure providers.

Regulatory Uncertainty: As AI becomes more critical to infrastructure, regulatory scrutiny will likely increase. Projects operating at the intersection of AI and crypto could face unique regulatory challenges.

Technical Obsolescence: The pace of AI development means today’s optimal solutions could become obsolete quickly. Projects need continuous innovation to maintain their value proposition.

Hyperbolic’s Position and the Aggregator Moat

The analysis highlights Hyperbolic as an interesting case study—a company that spans the GPU leasing, deployment, and model API layers without owning any hardware. This “asset-light” approach creates a powerful moat as the market becomes more fragmented.

For crypto investors, this suggests that the most valuable projects may not be those building physical infrastructure but those creating the economic layers that optimize resource allocation. The real value lies in the coordination and optimization of existing resources, not necessarily in owning them.

Conclusion

The shift from training to inference represents a fundamental reordering of value in the AI stack. For crypto investors, this creates opportunities to build the financial and economic layers that will enable the efficient allocation of compute resources. The most promising projects will likely be those that can create liquidity in fragmented markets, optimize routing across providers, and enable the tokenization of compute capacity.

As the inference squeeze accelerates and AI agents begin consuming resources at an exponential rate, the crypto market’s role in creating the financial infrastructure for this new paradigm will become increasingly important. The $600 billion problem may ultimately be solved not by building more hardware, but by creating more efficient markets for existing resources.

🚀 Bybit Limited Time: The World's #1 Crypto Platform! Sign up to claim up to 30,000 USDT in rewards, and automatically activate a lifetime 20% Fee Discount!
Join Bybit Now