Processing 200,000 tokens by a big language mannequin is costly and gradual: the longer the context, the sooner the prices spiral. Researchers at Tsinghua College and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse consideration fashions, delivering up to 1.82x sooner time-to-first-token and 1.48x sooner technology throughput at that context size.
The approach applies to fashions utilizing the DeepSeek Sparse Consideration structure, together with the newest DeepSeek and GLM households. It could possibly assist enterprises present sooner person experiences for production-scale, long-context fashions, a functionality already confirmed in preliminary exams on the 744-billion-parameter GLM-5 mannequin.
The DSA bottleneck
Massive language fashions rely on the self-attention mechanism, a course of the place the mannequin computes the relationship between each token in its context and all the previous ones to predict the subsequent token.
Nonetheless, self-attention has a extreme limitation. Its computational complexity scales quadratically with sequence size. For purposes requiring prolonged context home windows (e.g., giant doc processing, multi-step agentic workflows, or lengthy chain-of-thought reasoning), this quadratic scaling leads to sluggish inference speeds and vital compute and reminiscence prices.
Sparse attention affords a principled answer to this scaling downside. As a substitute of calculating the relationship between each token and all previous ones, sparse consideration optimizes the course of by having every question choose and attend to solely the most related subset of tokens.
DeepSeek Sparse Attention (DSA) is a extremely environment friendly implementation of this idea, first launched in DeepSeek-V3.2. To find out which tokens matter most, DSA introduces a light-weight “lightning indexer module” at each layer of the mannequin. This indexer scores all previous tokens and selects a small batch for the primary core consideration mechanism to course of. By doing this, DSA slashes the heavy core consideration computation from quadratic to linear, dramatically dashing up the mannequin whereas preserving output high quality.
However the researchers recognized a lingering flaw: the DSA indexer itself nonetheless operates at a quadratic complexity at each single layer. Although the indexer is computationally cheaper than the primary consideration course of, as context lengths develop, the time the mannequin spends operating these indexers skyrockets. This severely slows down the mannequin, particularly throughout the preliminary “prefill” stage the place the immediate is first processed.
Caching consideration with IndexCache
To resolve the indexer bottleneck, the analysis crew found an important attribute of how DSA fashions course of information. The subset of essential tokens an indexer selects stays remarkably steady as information strikes by consecutive transformer layers. Empirical exams on DSA fashions revealed that adjoining layers share between 70% and 100% of their chosen tokens.
To capitalize on this cross-layer redundancy, the researchers developed IndexCache. The approach partitions the mannequin’s layers into two classes. A small variety of full (F) layers retain their indexers, actively scoring the tokens and selecting the most essential ones to cache. The remainder of the layers turn into shared (S), performing no indexing and reusing the cached indices from the nearest previous F layer.
Throughout inference, the mannequin merely checks the layer kind. If it reaches an F layer, it calculates and caches contemporary indices. If it is an S layer, it skips the math and copies the cached information.
There is a variety of optimization methods that attempt to deal with the consideration bottleneck by compressing the KV cache, the place the computed consideration values are saved. As a substitute of shrinking the reminiscence footprint like commonplace KV cache compression, IndexCache assaults the compute bottleneck.
“IndexCache is not a standard KV cache compression or sharing approach,” Yushi Bai, co-author of the paper, advised VentureBeat. “It eliminates this redundancy by reusing indices throughout layers, thereby decreasing computation relatively than simply reminiscence footprint. It is complementary to present approaches and may be mixed with them.”
The researchers developed two deployment approaches for IndexCache. (It is value noting that IndexCache solely applies to fashions that use the DSA structure, akin to the newest DeepSeek fashions and the newest household of GLM fashions.)
For builders working with off-the-shelf DSA fashions the place retraining is unfeasible or too costly, they created a training-free technique relying on a “grasping layer choice” algorithm. By operating a small calibration dataset by the mannequin, this algorithm robotically determines the optimum placement of F and S layers with none weight updates. Empirical proof exhibits that the grasping algorithm can safely take away 75% of the indexers whereas matching the downstream efficiency of the authentic mannequin.
For groups pre-training or closely fine-tuning their very own basis fashions, the researchers suggest a training-aware model that optimizes the community parameters to natively assist cross-layer sharing. This strategy introduces a “multi-layer distillation loss” throughout coaching. It forces every retained indexer to find out how to choose a consensus subset of tokens that will probably be extremely related for all the subsequent layers it serves.
Actual-world speedups on manufacturing fashions
To check the influence of IndexCache, the researchers utilized it to the 30-billion-parameter GLM-4.7 Flash mannequin and in contrast it towards the commonplace baseline.
At a 200K context size, eradicating 75% of the indexers slashed the prefill latency from 19.5 seconds down to simply 10.7 seconds, delivering a 1.82x speedup. The researchers observe these speedups are anticipated to be even higher in longer contexts.
Throughout the decoding section, the place the mannequin generates its response, IndexCache boosted per-request throughput from 58 tokens per second to 86 tokens per second at the 200K context mark, yielding a 1.48x speedup. When the server’s reminiscence is absolutely saturated with requests, whole decode throughput jumped by up to 51%.
For enterprise groups, these effectivity features translate immediately into price financial savings. “When it comes to ROI, IndexCache gives constant advantages throughout situations, however the features are most noticeable in long-context workloads akin to RAG, doc evaluation, and agentic pipelines,” Bai mentioned. “In these circumstances, we observe a minimum of an approximate 20% discount in deployment price and comparable enhancements in user-perceived latency.” He added that for very short-context duties, the advantages hover round 5%.
Remarkably, these effectivity features did not compromise reasoning capabilities. Utilizing the training-free strategy to remove 75% of indexers, the 30B mannequin matched the authentic baseline’s common rating on long-context benchmarks, scoring 49.9 towards the authentic 50.2. On the extremely advanced AIME 2025 math reasoning benchmark, the optimized mannequin really outperformed the authentic baseline, scoring 92.6 in contrast to 91.0.
The crew additionally ran preliminary experiments on the production-scale 744-billion-parameter GLM-5 mannequin. They discovered that eliminating 75% of its indexers with the training-free technique yielded a minimum of a 1.3x speedup on contexts over 100K tokens. At the identical time, the mannequin maintained a virtually similar high quality common on long-context duties.
Getting IndexCache into manufacturing
For improvement groups wanting to implement the training-free strategy at the moment, the course of is simple however requires cautious setup. Whereas the grasping search algorithm robotically finds the optimum layer configuration, the high quality of that configuration relies upon on the information it processes.
“We suggest utilizing domain-specific information as a calibration set in order that the found layer-sharing sample aligns with actual workloads,” Bai mentioned.
As soon as calibrated, the optimization is extremely accessible for manufacturing environments. Open-source patches are already available on GitHub for main serving engines. “Integration is comparatively simple — builders can apply the patch to present inference stacks, akin to vLLM or SGLang, and allow IndexCache with minimal configuration modifications,” Bai mentioned.
Whereas IndexCache gives a right away repair for at the moment’s compute bottlenecks, its underlying philosophy factors to a broader shift in how the AI trade will strategy mannequin design.
“Future basis fashions will probably be architected with downstream inference constraints in thoughts from the starting,” Bai concluded. “This means designs that are not solely scalable by way of mannequin dimension, but in addition optimized for real-world throughput and latency, relatively than treating these as post-hoc issues.”
Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.