Nvidia simply admitted the general-purpose GPU period is ending



Nvidia’s $20 billion strategic licensing take care of Groq represents one in all the first clear strikes in a four-front battle over the future AI stack. 2026 is when that battle turns into apparent to enterprise builders.

For the technical decision-makers we speak to day by day — the folks constructing the AI functions and the knowledge pipelines that drive them — this deal is a sign that the period of the one-size-fits-all GPU as the default AI inference reply is ending.

We are coming into the age of the disaggregated inference structure, the place the silicon itself is being cut up into two differing kinds to accommodate a world that calls for each huge context and instantaneous reasoning.

Why inference is breaking the GPU structure in two

To grasp why Nvidia CEO Jensen Huang dropped one-third of his reported $60 billion cash pile on a licensing deal, you will have to take a look at the existential threats converging on his firm’s reported 92% market share

The business reached a tipping level in late 2025: For the first time, inference — the part the place educated fashions truly run — surpassed training in terms of total data center revenue, in accordance to Deloitte. On this new “Inference Flip,” the metrics have modified. Whereas accuracy stays the baseline, the battle is now being fought over latency and the capacity to preserve “state” in autonomous brokers.

There are 4 fronts of that battle, and every entrance factors to the similar conclusion: Inference workloads are fragmenting quicker than GPUs can generalize.

1. Breaking the GPU in two: Prefill vs. decode

Gavin Baker, an investor in Groq (and subsequently biased, but additionally unusually fluent on the structure), summarized the core driver of the Groq deal cleanly: “Inference is disaggregating into prefill and decode.”

Prefill and decode are two distinct phases:

  • The prefill part: Consider this as the person’s “immediate” stage. The mannequin should ingest huge quantities of knowledge — whether or not it is a 100,000-line codebase or an hour of video — and compute a contextual understanding. This is “compute-bound,” requiring huge matrix multiplication that Nvidia’s GPUs are traditionally glorious at.

  • The technology (decode) part: This is the precise token-by-token “technology.” As soon as the immediate is ingested, the mannequin generates one phrase (or token) at a time, feeding each again into the system to predict the subsequent. This is “memory-bandwidth certain.” If the knowledge cannot transfer from the reminiscence to the processor quick sufficient, the mannequin stutters, irrespective of how highly effective the GPU is. (This is the place Nvidia was weak, and the place Groq’s particular language processing unit (LPU) and its associated SRAM reminiscence, shines. Extra on that in a bit.)

Nvidia has announced an upcoming Vera Rubin family of chips that it’s architecting particularly to deal with this cut up. The Rubin CPX part of this household is the designated “prefill” workhorse, optimized for enormous context home windows of 1 million tokens or extra. To deal with this scale affordably, it strikes away from the eye-watering expense of excessive bandwidth reminiscence (HBM) — Nvidia’s present gold-standard reminiscence that sits proper subsequent to the GPU die — and as a substitute makes use of 128GB of a brand new type of reminiscence, GDDR7. Whereas HBM offers excessive pace (although not as fast as Groq’s static random-access reminiscence (SRAM)), its provide on GPUs is restricted and its value is a barrier to scale; GDDR7 offers a less expensive approach to ingest huge datasets.

In the meantime, the “Groq-flavored” silicon, which Nvidia is integrating into its inference roadmap, will function the high-speed “decode” engine. This is about neutralizing a risk from various architectures like Google’s TPUs and sustaining the dominance of CUDA, Nvidia’s software program ecosystem that has served as its major moat for over a decade.

All of this was sufficient for Baker, the Groq investor, to predict that Nvidia’s transfer to license Groq will trigger all different specialised AI chips to be canceled — that is, outdoors of Google’s TPU, Tesla’s AI5, and AWS’s Trainium.

2. The differentiated energy of SRAM

At the coronary heart of Groq’s know-how is SRAM. In contrast to the DRAM present in your PC or the HBM on an Nvidia H100 GPU, SRAM is etched straight into the logic of the processor.

Michael Stewart, managing accomplice of Microsoft’s enterprise fund, M12, describes SRAM as the finest for transferring knowledge over quick distances with minimal vitality. “The vitality to transfer a bit in SRAM is like 0.1 picojoules or much less,” Stewart stated. “To maneuver it between DRAM and the processor is extra like 20 to 100 instances worse.”

In the world of 2026, the place brokers should purpose in real-time, SRAM acts as the final “scratchpad”: a high-speed workspace the place the mannequin can manipulate symbolic operations and sophisticated reasoning processes with out the “wasted cycles” of external reminiscence shuttling.

Nonetheless, SRAM has a serious disadvantage: it is bodily cumbersome and costly to manufacture, that means its capability is restricted in contrast to DRAM. This is the place Val Bercovici, chief AI officer at Weka, one other firm providing reminiscence for GPUs, sees the market segmenting.

Groq-friendly AI workloads — the place SRAM has the benefit — are people who use small fashions of 8 billion parameters and under, Bercovici stated. This isn’t a small market, although. “It’s only a big market section that was not served by Nvidia, which was edge inference, low latency, robotics, voice, IoT units — issues we wish working on our telephones with out the cloud for comfort, efficiency, or privateness,” he stated.

This 8B “candy spot” is vital as a result of 2025 noticed an explosion in mannequin distillation, the place many enterprise corporations are shrinking massive models into highly efficient smaller versions. Whereas SRAM is not sensible for the trillion-parameter “frontier” fashions, it is good for these smaller, high-velocity fashions.

3. The Anthropic risk: The rise of the ‘moveable stack’

Maybe the most under-appreciated driver of this deal is Anthropic’s success in making its stack moveable throughout accelerators.

The corporate has pioneered a portable engineering approach for coaching and inference — principally a software program layer that permits its Claude fashions to run throughout a number of AI accelerator households — together with Nvidia’s GPUs and Google’s Ironwood TPUs. Till lately, Nvidia’s dominance was protected as a result of working high-performance fashions outdoors of the Nvidia stack was a technical nightmare. “It’s Anthropic,” Weka’s Bercovici advised me. “The truth that Anthropic was ready to … construct up a software program stack that would work on TPUs in addition to on GPUs, I don’t assume that’s being appreciated sufficient in the market.”

(Disclosure: Weka has been a sponsor of VentureBeat occasions.)

Anthropic lately dedicated to accessing up to 1 million TPUs from Google, representing over a gigawatt of compute capability. This multi-platform method ensures the firm is not held hostage by Nvidia’s pricing or provide constraints. So for Nvidia, the Groq deal is equally a defensive transfer. By integrating Groq’s ultra-fast inference IP, Nvidia is ensuring that the most performance-sensitive workloads — like these working small fashions or as a part of real-time brokers — could be accommodated inside Nvidia’s CUDA ecosystem, at the same time as opponents strive to bounce ship to Google’s Ironwood TPUs. CUDA is the particular software program Nvidia offers to builders to combine GPUs. 

4. The agentic ‘statehood’ battle: Manus and the KV Cache

The timing of this Groq deal coincides with Meta’s acquisition of the agent pioneer Manus just two days ago. The importance of Manus was partly its obsession with statefulness.

If an agent can’t bear in mind what it did 10 steps in the past, it is ineffective for real-world duties like market analysis or software program growth. KV Cache (Key-Worth Cache) is the “short-term reminiscence” that an LLM builds throughout the prefill part.

Manus reported that for production-grade brokers, the ratio of enter tokens to output tokens can attain 100:1. This means for each phrase an agent says, it is “pondering” and “remembering” 100 others. On this surroundings, the KV Cache hit charge is the single most essential metric for a manufacturing agent, Manus stated. If that cache is “evicted” from reminiscence, the agent loses its practice of thought, and the mannequin should burn huge vitality to recompute the immediate.

Groq’s SRAM is usually a “scratchpad” for these brokers — though, once more, principally for smaller fashions — as a result of it permits for the near-instant retrieval of that state. Mixed with Nvidia’s Dynamo framework and the KVBM, Nvidia is constructing an “inference working system” that allows inference servers to tier this state throughout SRAM, DRAM, HBM, and different flash-based choices like that from Bercovici’s Weka.

Thomas Jorgensen, senior director of Expertise Enablement at Supermicro, which focuses on constructing clusters of GPUs for big enterprise corporations, advised me in September that compute is now not the major bottleneck for superior clusters. Feeding knowledge to GPUs was the bottleneck, and breaking that bottleneck requires reminiscence.

“The entire cluster is now the laptop,” Jorgensen stated. “Networking turns into an inside a part of the beast … feeding the beast with knowledge is changing into more durable as a result of the bandwidth between GPUs is rising quicker than anything.”

This is why Nvidia is pushing into disaggregated inference. By separating the workloads, enterprise functions can use specialised storage tiers to feed knowledge at memory-class efficiency, whereas the specialised “Groq-inside” silicon handles the high-speed token technology.

The decision for 2026

We are coming into an period of utmost specialization. For many years, incumbents might win by transport one dominant general-purpose structure — and their blind spot was typically what they ignored on the edges. Intel’s lengthy neglect of low-power is the traditional instance, Michael Stewart, managing accomplice of Microsoft’s enterprise fund M12, advised me. Nvidia is signaling it gained’t repeat that mistake. “If even the chief, even the lion of the jungle will purchase expertise, will purchase know-how — it’s an indication that the complete market is simply wanting extra choices,” Stewart stated.

For technical leaders, the message is to cease architecting your stack prefer it’s one rack, one accelerator, one reply. In 2026, benefit will go to the groups that label workloads explicitly — and route them to the proper tier:

  • prefill-heavy vs. decode-heavy

  • long-context vs. short-context

  • interactive vs. batch

  • small-model vs. large-model

  • edge constraints vs. data-center assumptions

Your structure will comply with these labels. In 2026, “GPU technique” stops being a buying choice and turns into a routing choice. The winners gained’t ask which chip they purchased — they’ll ask the place each token ran, and why.




Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.

0
Show Comments (0) Hide Comments (0)
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Stay Updated!

Subscribe to get the latest blog posts, news, and updates delivered straight to your inbox.