
On Sunday, a workforce of 9 researchers at Sina Weibo — the Chinese language social media large higher identified for its microblogging platform than for cutting-edge synthetic intelligence — quietly posted a 14-page technical report to arXiv that despatched shockwaves by means of the AI analysis neighborhood. Their declare: a language mannequin with simply 3 billion parameters can match or exceed the reasoning efficiency of flagship methods from Google DeepMind, OpenAI, Anthropic, and DeepSeek that are a whole lot of occasions bigger.
The mannequin, referred to as VibeThinker-3B, scored 94.3 on AIME 2026 — the American Invitational Arithmetic Examination, considered one of the most demanding standardized math competitions in the world. That determine locations it alongside DeepSeek V3.2, a mannequin with 671 billion parameters, and forward of Gemini 3 Pro, Google’s high-performance flagship reasoning system, which scored 91.7. With a test-time scaling approach the workforce calls Declare-Stage Reliability Evaluation, the rating climbs to 97.1, edging previous nearly each system in the public file.
Inside hours of publication, the paper had drawn 62 upvotes on Hugging Face’s daily papers feed, the mannequin repository had collected 130 likes, and the GitHub repository had reached 685 stars. However the response on social media was not uniformly celebratory. It was, in lots of circumstances, deeply skeptical.
“WHAT THE HELL is occurring in AI?” wrote the person @orcus108 on X, in a put up that collected over 161,000 views. “A 3B parameter mannequin simply put up coding benchmark scores in the identical league as Claude Opus 4.5… I genuinely do not know if this is a breakthrough or if the benchmarks are damaged.”
That rigidity — between real scientific development and the rising suspicion that AI benchmarks have turn out to be gameable to the level of meaninglessness — sits at the coronary heart of the VibeThinker-3B story. And the reply issues enormously, not only for tutorial bragging rights, however for the multibillion-dollar query of whether or not the AI business’s relentless push towards ever-larger fashions is the solely path to intelligence.
Benchmark scores that defy the scaling legal guidelines of contemporary AI
The outcomes reported in the technical report are, by any standard commonplace, extraordinary.
On the arithmetic facet, VibeThinker-3B achieved 91.4 on AIME 2025, 94.3 on AIME 2026, 89.3 on HMMT 2025 (the Harvard-MIT Arithmetic Event), 93.8 on BruMO 2025 (the Brown College Math Olympiad), and 76.4 on IMO-AnswerBench, a benchmark comprising 400 issues at the stage of the Worldwide Mathematical Olympiad. In coding, it posted an 80.2 Cross@1 on LiveCodeBench v6, a benchmark designed to take a look at executable code technology, and achieved a 96.1 p.c acceptance charge on unseen LeetCode weekly and biweekly contests from late April by means of late Could 2026. On instruction following, it scored 93.4 on IFEval.
To place the parameter disparity in perspective: DeepSeek V3.2 has 671 billion parameters — roughly 224 occasions the measurement of VibeThinker-3B. GLM-5, from Zhipu AI, has 744 billion parameters. Kimi K2.5, from Moonshot AI, exceeds 1 trillion. VibeThinker-3B’s 3 billion parameters may run on a shopper laptop computer.
The researchers body this consequence not as an anomaly however as proof for a broader theoretical declare. They introduce what they name the “Parametric Compression-Coverage Hypothesis,” which argues that various kinds of AI functionality have basically completely different relationships to mannequin measurement. Verifiable reasoning — the sort examined by math competitions and coding challenges, the place solutions will be definitively checked — is what the paper calls a “parameter-dense” functionality: one that may be compressed right into a compact core. Open-domain information, against this, is “parameter-expansive,” requiring broad protection throughout details, ideas, and edge circumstances that inherently calls for extra parameters.
The paper acknowledges this distinction instantly. On GPQA-Diamond, a graduate-level science information benchmark, VibeThinker-3B scored simply 70.2 — properly behind the 91.9 achieved by Gemini 3 Professional and the 87.0 scored by Claude Opus 4.5. The authors write that this hole “is per our declare relatively than a contradiction to it: the important discovering is not {that a} 3B mannequin has absolutely changed main general-purpose fashions, however {that a} small mannequin can attain first-tier efficiency on many verifiable reasoning duties.”
Inside the four-stage coaching pipeline that powers a tiny reasoning engine
VibeThinker-3B is not constructed from scratch. It is post-trained on high of Qwen2.5-Coder-3B, a compact basis mannequin from Alibaba’s Qwen workforce, by means of what the Weibo AI researchers name the “Spectrum-to-Sign Precept” — a multi-stage pipeline first launched in the workforce’s earlier VibeThinker-1.5B work in November 2025.
The coaching unfolds in 4 main phases. The primary is a two-stage supervised fine-tuning course of that makes use of curriculum studying: the mannequin first trains on a broad combination of math, code, STEM reasoning, basic dialogue, and instruction-following knowledge, then shifts to a curated subset of more durable, longer-horizon reasoning issues. In the second stage, samples with reasoning traces shorter than 5,000 tokens are discarded, and issues that VibeThinker-1.5B can clear up greater than 75 p.c of the time are filtered out, forcing the mannequin to focus on genuinely troublesome challenges.
The second part applies reinforcement studying throughout a number of domains — arithmetic, code, and STEM — utilizing the workforce’s MaxEnt-Guided Policy Optimization algorithm, or MGPO, which prioritizes coaching on issues at the mannequin’s present functionality boundary relatively than issues it already solves simply or finds not possible. Notably, the workforce discovered {that a} technique that labored properly at the 1.5B scale — progressively increasing the context window throughout RL coaching — truly harm efficiency at 3B. They hypothesize that the stronger beginning checkpoint meant that truncating reasoning traces throughout warm-up was now not eradicating noise however disrupting legitimate reasoning patterns. The answer was to prepare with a single 64,000-token context window all through.
Inside the math RL part, the workforce additionally introduces what it calls “Long2Short Math RL,” a secondary optimization stage that redistributes rewards to favor shorter right options over longer ones, decreasing verbosity with out sacrificing accuracy. The approach makes use of a zero-sum reward redistribution that avoids biasing the total reward sign whereas nudging the mannequin towards extra environment friendly reasoning.
The third part extracts high-quality reasoning trajectories from the RL-trained checkpoints and distills them again right into a unified mannequin by means of supervised fine-tuning. The workforce makes use of a “learning-potential rating” — primarily the scholar mannequin’s perplexity on every instructor trajectory — to prioritize traces that are right however that the scholar has not but internalized. The ultimate part, referred to as Instruct RL, applies reinforcement studying on instruction-following duties utilizing a mixture of rule-based validators for format constraints and rubric-based reward fashions for open-ended high quality evaluation.
Francesco Bertolotti, an AI researcher who flagged the paper early on X, described the strategy succinctly: “These outcomes have been achieved primarily by means of post-training refinements on Qwen2.5-Coder. The paper does not present many details, nevertheless it seems they distill from RL ckpts after which do a ultimate RL-based instruct RL.” His put up drew over 161,000 views.
Actual-world testing reveals the hole between benchmark scores and sensible AI efficiency
For each enthusiastic response, the paper drew an equally forceful objection. The AI analysis neighborhood in mid-2026 has grown deeply cautious of benchmark-driven claims, and VibeThinker-3B arrived in an setting primed for suspicion.
“The benchmarks are literal sample matching single file coding,” wrote @BigMoonKR on X. “It has no relation to precise coding work. I do not know the way folks nonetheless do not get this.”
“Benchmaxxing,” declared @oflu_bedirhan, utilizing a time period that has turn out to be shorthand in the AI neighborhood for fashions that seem optimized particularly for benchmark efficiency at the expense of real-world utility.
Probably the most pointed criticism got here from customers who truly downloaded and examined the mannequin. “Simply tried the full precision,” wrote @politilols. “It does not even know what a uv script (so the hottest Python dev software) is. Have not seen that in a single LLM in no less than a 12 months now. Benchmaxxed.” When Bertolotti responded that the mannequin appeared extra targeted on mathematical reasoning than sensible coding, the person countered: “They embody a livecodebench rating. Zero likelihood that is reflective of the mannequin.”
@Itsdotdev raised a structural criticism: “Look into the benchmarks themselves and it in all probability will not be so stunning. Why no DeepSWE? Why none of the commonplace benchmarks SOTA suppliers use?” The person @AvenirReym posed a extra diagnostic query: “If it holds on a benchmark made after the mannequin’s coaching cutoff, it is actual. If it solely wins on AIME-style units which have been circulating for years, it is leakage.”
The paper’s authors seem to have anticipated these objections. The technical report states that coaching units “have undergone strict benchmark decontamination,” together with n-gram-based filtering to take away “n-gram overlaps with analysis units.”
The LeetCode contest analysis — which covers contests from April 25 to Could 31, 2026, dates that postdate any believable coaching knowledge cutoff — represents the most sturdy guard in opposition to knowledge contamination considerations. On these contests, VibeThinker-3B handed 123 out of 128 first-attempt submissions, a 96.1 p.c charge that exceeded GPT-5.2, Doubao Seed 2.0 Professional, Kimi K2.5, and Claude Opus 4.6 underneath an identical analysis situations.
Nonetheless, real-world person reviews recommend a major hole between benchmark efficiency and sensible utility — a phenomenon that has turn out to be acquainted throughout the business. “In LM Studio it solely responds properly to first query, subsequent questions reply to the first query,” reported @luismolinaab.
Why a social media firm could have discovered a crack in the scaling speculation
Even the sharpest critics acknowledged that attaining these benchmark numbers at 3 billion parameters — no matter how transferable they are to manufacturing use circumstances — is a significant engineering achievement. “Even when it is benchmaxxing doing so with 3B parameters is fascinating, goes to present how briskly this discipline is progressing,” wrote @rohityin.
The statement cuts to a query that has consumed the AI business since the creation of the scaling speculation: Is larger all the time higher? The standard knowledge, articulated most famously in the Chinchilla scaling legal guidelines and strengthened by the business dominance of ever-larger basis fashions, holds that extra parameters and extra coaching knowledge reliably yield higher efficiency. The financial corollary is stark: coaching and deploying frontier fashions prices tens or a whole lot of hundreds of thousands of {dollars}, creating monumental boundaries to entry.
VibeThinker-3B challenges that consensus — however solely partially. The paper is cautious to draw a boundary round its claims, distinguishing between duties with “clear verification indicators” and those who require broad factual information. The Parametric Compression-Protection Speculation explicitly argues that small fashions can not substitute massive ones throughout the board.
“The true significance of VibeThinker-3B does not lie in proving {that a} 3B mannequin can substitute large-scale generalists,” the paper states, “however relatively in offering a concrete empirical sign: the growth of compact fashions is now not merely a passive compromise for deployment effectivity or price management; it emerges as a promising analysis trajectory that is basically complementary to the conventional parameter scaling paradigm.”
Maybe the most shocking component of the work is its provenance. Sina Weibo — publicly traded on Nasdaq and Hong Kong, with a market capitalization that fluctuates in the single-digit billions — is not an organization sometimes related to frontier AI analysis. But the VibeThinker collection is Weibo’s second main open-source AI contribution in seven months.
VibeThinker-1.5B, launched in November 2025, demonstrated {that a} mannequin with simply 1.5 billion parameters may outperform the authentic DeepSeek R1 on a number of math benchmarks — a consequence the workforce achieved for what it claimed was a post-training price of simply $7,800, in contrast to the $294,000 estimated for DeepSeek R1.
The analysis workforce is compact — 9 authors, all listed as Sina Weibo Inc. workers. The mannequin is launched underneath the MIT License, considered one of the most permissive open-source licenses obtainable, and the weights are freely downloadable from each Hugging Face and ModelScope. Inside the first day of launch, neighborhood members had already created GGUF quantizations and by-product fashions.
Small fashions, massive implications, and the query the AI business can now not keep away from
Probably the most sincere evaluation of VibeThinker-3B could also be that it is concurrently much less and greater than what the benchmarks recommend. Much less, as a result of a mannequin that struggles with primary information of widespread developer instruments is unlikely to substitute any production-grade coding assistant anytime quickly. Extra, as a result of the underlying perception — that reasoning means and factual information are partially decoupled, and that the former will be compressed much more aggressively than beforehand assumed — has profound implications for a way the business thinks about mannequin design, deployment economics, and the accessibility of superior AI capabilities.
If the Parametric Compression-Coverage Hypothesis holds, it suggests a future by which small, specialised reasoning engines function alongside massive knowledge-rich fashions in hybrid architectures — a imaginative and prescient the place a 3-billion-parameter mannequin handles the logical heavy lifting whereas a bigger system provides the factual grounding. Such an structure may dramatically scale back the price of deploying AI reasoning capabilities, doubtlessly bringing competition-level mathematical and coding efficiency to units with modest {hardware}.
“The fascinating half is that we’re beginning to separate information from reasoning,” wrote @RealLambdaFlux on X. “A small mannequin with robust post-training can punch approach above its measurement on duties with clear suggestions.”
@cmitsakis instructed the sensible endgame: “I believe small fashions are the future for brokers as a result of they’ll use instruments to get the information they usually can run quick and low cost.”
Whether or not that future arrives by means of VibeThinker-3B particularly, or by means of the dozens of groups now racing to reproduce and lengthen these outcomes, the paper has already achieved one thing that no benchmark rating can absolutely seize.
It has compelled the AI neighborhood to confront an uncomfortable chance: that for years, the business could have been spending billions of {dollars} scaling up parameters to enhance a sort of intelligence that would have match, all alongside, on a laptop computer. The weights are public. The code is open. And the most vital take a look at is not on any leaderboard — it is whether or not anybody could make a mannequin this small truly helpful in the actual world.
Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.