Qwen3-Max Considering beats Gemini 3 Professional and GPT-5.2 on Humanity’s Final Examination (with search)


Chinese language AI and tech corporations proceed to impress with their growth of cutting-edge, state-of-the-art AI language fashions.

As we speak, the one drawing eyeballs is Alibaba Cloud’s Qwen Group of AI researchers and its unveiling of a brand new proprietary language reasoning mannequin, Qwen3-Max-Thinking.

You could recall, as VentureBeat lined final 12 months, that Qwen has made a reputation for itself in the fast-moving world AI market by delivery a wide range of highly effective, open supply fashions in varied modalities, from textual content to picture to spoken audio. The corporate even earned an endorsement from U.S. tech lodgings big Airbnb, whose CEO and co-founder Brian Chesky said the company was relying on Qwen’s free, open source models as a extra inexpensive different to U.S. choices like these of OpenAI.

Now, with the proprietary Qwen3-Max-Considering, the Qwen Group is aiming to match and, in some instances, outpace the reasoning capabilities of GPT-5.2 and Gemini 3 Professional by way of architectural effectivity and agentic autonomy.

The discharge comes at a important juncture. Western labs have largely outlined the “reasoning” class (usually dubbed “System 2” logic), however Qwen’s newest benchmarks counsel the hole has closed.

As well as, the firm’s comparatively inexpensive API pricing strategy aggressively targets enterprise adoption. Nevertheless, because it is a Chinese language mannequin, some U.S. corporations with strict nationwide safety necessities and concerns could also be cautious of adopting it.

The Structure: “Take a look at-Time Scaling” Redefined

The core innovation driving Qwen3-Max-Considering is a departure from customary inference strategies. Whereas most fashions generate tokens linearly, Qwen3 makes use of a “heavy mode” pushed by a method often known as “Take a look at-time scaling.”

In easy phrases, this method permits the mannequin to commerce compute for intelligence. However not like naive “best-of-N” sampling—the place a mannequin may generate 100 solutions and choose the greatest one — Qwen3-Max-Considering employs an experience-cumulative, multi-round technique.

This strategy mimics human problem-solving. When the mannequin encounters a fancy question, it would not simply guess; it engages in iterative self-reflection. It makes use of a proprietary “take-experience” mechanism to distill insights from earlier reasoning steps. This permits the mannequin to:

  1. Determine Useless Ends: Acknowledge when a line of reasoning is failing while not having to totally traverse it.

  2. Focus Compute: Redirect processing energy towards “unresolved uncertainties” moderately than re-deriving identified conclusions.

The effectivity positive aspects are tangible. By avoiding redundant reasoning, the mannequin integrates richer historic context into the identical window. The Qwen workforce studies that this methodology drove huge efficiency jumps with out exploding token prices:

Past Pure Thought: Adaptive Tooling

Whereas “pondering” fashions are highly effective, they’ve traditionally been siloed — nice at math, however poor at shopping the internet or working code. Qwen3-Max-Considering bridges this hole by successfully integrating “pondering and non-thinking modes”.

The mannequin options adaptive tool-use capabilities, that means it autonomously selects the proper device for the job with out handbook consumer prompting. It will possibly seamlessly toggle between:

  • Internet Search & Extraction: For real-time factual queries.

  • Reminiscence: To retailer and recall user-specific context.

  • Code Interpreter: To jot down and execute Python snippets for computational duties.

In “Considering Mode,” the mannequin helps these instruments concurrently. This functionality is important for enterprise functions the place a mannequin may want to verify a reality (Search), calculate a projection (Code Interpreter), after which cause about the strategic implication (Considering) multi function flip.

Empirically, the workforce notes that this mix “successfully mitigates hallucinations,” as the mannequin can floor its reasoning in verifiable external knowledge moderately than relying solely on its coaching weights.

Benchmark Evaluation: The Knowledge Story

Qwen is not shy about direct comparisons.

On HMMT Feb 25, a rigorous reasoning benchmark, Qwen3-Max-Considering scored 98.0, edging out Gemini 3 Professional (97.5) and considerably main DeepSeek V3.2 (92.5).

Nevertheless, the most vital sign for builders is arguably Agentic Search. On “Humanity’s Final Examination” (HLE) — the benchmark that measures efficiency on 3,000 “Google-proof” graduate-level questions throughout math, science, laptop science, humanities and engineering — Qwen3-Max-Considering, geared up with internet search instruments, scored 49.8, beating each Gemini 3 Professional (45.8) and GPT-5.2-Considering (45.5) .

Qwen3-Max key benchmarks

Qwen3-Max key benchmarks. Credit score: Alibaba Cloud Qwen Group on X

This means that Qwen3-Max-Considering’s structure is uniquely suited to complicated, multi-step agentic workflows the place external knowledge retrieval is vital.

In coding duties, the mannequin additionally shines. On Enviornment-Arduous v2, it posted a rating of 90.2, leaving opponents like Claude-Opus-4.5 (76.7) far behind.

The Economics of Reasoning: Pricing Breakdown

For the first time, we have now a transparent take a look at the economics of Qwen’s top-tier reasoning mannequin. Alibaba Cloud has positioned qwen3-max-2026-01-23 as a premium however accessible providing on its API.

On a base stage, this is how Qwen3-Max-Considering stacks up:

Mannequin

Enter (/1M)

Output (/1M)

Whole Price

Supply

Qwen 3 Turbo

$0.05

$0.20

$0.25

Alibaba Cloud

Grok 4.1 Quick (reasoning)

$0.20

$0.50

$0.70

xAI

Grok 4.1 Quick (non-reasoning)

$0.20

$0.50

$0.70

xAI

deepseek-chat (V3.2-Exp)

$0.28

$0.42

$0.70

DeepSeek

deepseek-reasoner (V3.2-Exp)

$0.28

$0.42

$0.70

DeepSeek

Qwen 3 Plus

$0.40

$1.20

$1.60

Alibaba Cloud

ERNIE 5.0

$0.85

$3.40

$4.25

Qianfan

Gemini 3 Flash Preview

$0.50

$3.00

$3.50

Google

Claude Haiku 4.5

$1.00

$5.00

$6.00

Anthropic

Qwen3-Max Considering (2026-01-23)

$1.20

$6.00

$7.20

Alibaba Cloud

Gemini 3 Professional (≤200K)

$2.00

$12.00

$14.00

Google

GPT-5.2

$1.75

$14.00

$15.75

OpenAI

Claude Sonnet 4.5

$3.00

$15.00

$18.00

Anthropic

Gemini 3 Professional (>200K)

$4.00

$18.00

$22.00

Google

Claude Opus 4.5

$5.00

$25.00

$30.00

Anthropic

GPT-5.2 Professional

$21.00

$168.00

$189.00

OpenAI

This pricing construction is aggressive, undercutting many legacy flagship fashions whereas providing state-of-the-art efficiency.

Nevertheless, builders ought to notice the granular pricing for the new agentic capabilities, as Qwen separates the value of “pondering” (tokens) from the value of “doing” (device use).

  • Agent Search Technique: Each customary search_strategy:agent and the extra superior search_strategy:agent_max are priced at $10 per 1,000 calls.

  • Internet Search: Priced at $10 per 1,000 calls by way of the Responses API.

Promotional Free Tier:To encourage adoption of its most superior options, Alibaba Cloud is at present providing two key instruments at no cost for a restricted time:

This pricing mannequin (low token value + à la carte device pricing) permits builders to construct complicated brokers that are cost-effective for textual content processing, whereas paying a premium solely when external actions—like a stay internet search—are explicitly triggered.

Developer Ecosystem

Recognizing that efficiency is ineffective with out integration, Alibaba Cloud has ensured Qwen3-Max-Considering is drop-in prepared.

  • OpenAI Compatibility: The API helps the customary OpenAI format, permitting groups to change fashions by merely altering the base_url and mannequin title.

  • Anthropic Compatibility: In a savvy transfer to seize the coding market, the API additionally helps the Anthropic protocol. This makes Qwen3-Max-Considering suitable with Claude Code, a well-liked agentic coding setting.

The Verdict

Qwen3-Max-Considering represents a maturation of the AI market in 2026. It strikes the dialog past “who has the smartest chatbot” to “who has the most succesful agent.”

By combining high-efficiency reasoning with adaptive, autonomous device use—and pricing it to transfer—Qwen has firmly established itself as a top-tier contender for the enterprise AI throne.

For builders and enterprises, the “Restricted Time Free” home windows on Code Interpreter and Internet Extractor counsel now is the time to experiment. The reasoning wars are far from over, however Qwen has simply deployed a really heavy hitter.




Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.

0
Show Comments (0) Hide Comments (0)
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Stay Updated!

Subscribe to get the latest blog posts, news, and updates delivered straight to your inbox.