Goodbye, Llama? Meta launches new proprietary AI mannequin Muse Spark — first since Superintelligence Labs’ formation


Meta has been one in every of the most fascinating corporations of the generative AI period — initially gaining a loyal and big following of customers for the launch of its principally open supply Llama household of enormous language fashions (LLMs) beginning in early 2023 however coming to screeching halt final yr after Llama 4 debuted to mixed reviews and finally, admissions of gaming benchmarks.

That bumpy rollout of Llama 4 apparently spurred Meta founder and CEO Mark Zuckerberg to totally overhaul Meta’s AI operations in the summer of 2025, forming a brand new inside division, Meta Superintelligence Labs (MSL) which he recruited 29-year-old former Scale AI co-founder and CEO Alexandr Wang to lead as Chief AI Officer.

Now, at this time, Meta is exhibiting us the fruits of that effort: Muse Spark, a brand new proprietary mannequin that Wang says (posting on rival social network X, used extra usually by the machine studying group) is “the strongest mannequin that meta has launched,” and has “help for tool-use, visible chain of thought, & multi-agent orchestration.” He additionally says it will likely be the begin of a brand new Muse household of fashions, elevating questions on what is going to change into of Meta’s widespread lineup and ongoing growth of the Llama household.

It arrives not as a generic chatbot, however as the basis for what Wang calls “private superintelligence”—an AI that doesn’t simply course of textual content however “sees and understands the world round you” to act as a digital extension of the self, echoing Zuckberg’s public manifesto for a vision of personal superintelligence revealed in summer time 2025.

Nevertheless, it is proprietary solely — confined for now to the Meta AI app and web site, in addition to a ” non-public API preview to choose customers,” in accordance to Meta’s blog post announcing it — a transfer possible to rankle the actually billions of customers of Llama fashions and the hundreds of builders who relied upon it (a few of whom are energetic individuals in rival social community Reddit’s r/LocalLLaMA subreddit). As well as, no pricing information for the mannequin has but been introduced.

It is unclear if Meta has ended growth on the Llama household completely. When requested immediately by VentureBeat, a Meta spokesperson stated in an e-mail: “Our present Llama fashions will proceed to be out there as open supply,” which doesn’t deal with the query of growth of future Llama fashions.

Visible chain-of-thought

At its core, Muse Spark is a natively multimodal reasoning mannequin. In contrast to earlier iterations that “stitched” imaginative and prescient and textual content collectively, Muse Spark was rebuilt from the floor up to combine visible information throughout its inside logic. This architectural shift allows “visible chain of thought,” permitting the mannequin to annotate dynamic environments—figuring out the parts of a fancy espresso machine or correcting a person’s yoga kind through side-by-side video evaluation.

Essentially the most important technical leap, nevertheless, is a brand new “Considering” mode. This function orchestrates a number of sub-agents to motive in parallel, permitting Meta to compete with excessive reasoning fashions like Google’s Gemini Deep Assume and OpenAI’s GPT-5.4 Professional.

In benchmarks, this mode achieved 58% in “Humanity’s Final Examination” and 38% in “FrontierScience Analysis,” figures that Meta claims validate their new scaling trajectory.

Maybe extra spectacular for the firm’s backside line is the mannequin’s effectivity. Meta studies that Muse Spark achieves its reasoning capabilities utilizing over an order of magnitude much less compute than Llama 4 Maverick, its earlier mid-size flagship. This effectivity is pushed by a course of referred to as “thought compression”. Throughout reinforcement studying, the mannequin is penalized for extreme “pondering time,” forcing it to remedy complicated issues with fewer reasoning tokens with out sacrificing accuracy.

Benchmarks reveal a return-to-form

The launch of Muse Spark is framed as a statistical “quantum leap,” ending Meta’s year-long absence from the absolute frontier of AI efficiency.

Meta Muse Spark benchmark chart.

Meta Muse Spark benchmark chart. Credit score: Meta

By reconciling Meta’s official inside knowledge with impartial auditing from third-party LLM monitoring agency Artificial Analysis, a transparent image emerges: Muse Spark is not only a marginal enchancment over the Llama sequence; it is a elementary re-entry into the “High 5” world fashions.

Artificial Analysis Intelligence Index graph with Meta Muse Spark

Synthetic Evaluation Intelligence Index graph with Meta Muse Spark. Credit score: Synthetic Evaluation/X

In accordance to the Synthetic Evaluation Intelligence Index v4.0, Muse Spark achieved a rating of 52. For context, Meta’s earlier flagship, Llama 4 Maverick, debuted in 2025 with an Index rating of simply 18.

By almost tripling its efficiency, Muse Spark now sits inside hanging distance of the business’s most elite methods, trailing solely Gemini 3.1 Professional Preview (57), GPT-5.4 (57), and Claude Opus 4.6 (53).

Meta’s official benchmarks recommend that Muse Spark is significantly dominant in multimodal reasoning, particularly the place visible figures and logic intersect.

  • CharXiv Reasoning: In “determine understanding,” Muse Spark achieved a rating of 86.4, considerably outperforming Claude Opus 4.6 (65.3), Gemini 3.1 Professional (80.2), and GPT-5.4 (82.8).

  • MMMU Professional: Official studies place the mannequin at 80.4, whereas Synthetic Evaluation’s impartial audit measured it at 80.5%. This makes it the second-most succesful imaginative and prescient mannequin on the market, surpassed solely by Gemini 3.1 Professional Preview (83.9% official; 82.4% impartial).

  • Visible Factuality (SimpleVQA): Muse Spark scored 71.3, inserting it forward of GPT-5.4 (61.1) and Grok 4.2 (57.4), although it narrowly trails Gemini 3.1 Professional (72.4).

These scores validate Meta’s focus on “visible chain of thought,” enabling the mannequin to not simply acknowledge objects, however to motive via complicated spatial issues and dynamic annotations.

The “Considering” gear of Muse Spark was put to the take a look at towards specialised benchmarks designed to break non-reasoning fashions.

  • Humanity’s Final Examination (HLE): On this multidisciplinary analysis, Meta studies a rating of 42.8 (No Instruments) and 50.4 (With Instruments). Unbiased audits by Synthetic Evaluation tracked the mannequin at 39.9%, trailing Gemini 3.1 Professional Preview (44.7%) and GPT-5.4 (41.6%).

  • GPQA Diamond (PhD Degree Reasoning): Muse Spark achieved a formidable 89.5, surpassing Grok 4.2 (88.5) however trailing the specialised “max reasoning” outputs of Opus 4.6 (92.7) and Gemini 3.1 Professional (94.3).

  • ARC AGI 2: This stays a notable weak level. Muse Spark scored 42.5, far behind the summary reasoning puzzles solved by Gemini 3.1 Professional (76.5) and GPT-5.4 (76.1).

  • CritPT (Physics Analysis): Unbiased auditing discovered Muse Spark achieved the fifth highest rating at 11%. This marks a considerable lead over Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%).

One among the most hanging outcomes from the official knowledge is Muse Spark’s efficiency in the well being sector, possible a results of Meta’s collaboration with over 1,000 physicians.

  • HealthBench Onerous: Muse Spark achieved 42.8, an enormous lead over Claude Opus 4.6 (14.8), Gemini 3.1 Professional (20.6), and even GPT-5.4 (40.1).

  • MedXpertQA (Multimodal): It scored 78.4, comfortably forward of Opus 4.6 (64.8) and Grok 4.2 (65.8), although it nonetheless trails Gemini 3.1 Professional’s top-tier rating of 81.3.

Agentic Techniques and Effectivity: The “Thought Compression” Impact

Whereas Muse Spark excels at reasoning, its “agentic” efficiency—executing real-world work duties—presents a extra nuanced image.

  • SWE-Bench Verified: Muse Spark scored 77.4, trailing Claude Opus 4.6 (80.8) and Gemini 3.1 Professional (80.6).

  • GDPval-AA Elo: Meta’s official rating of 1444 differs barely from Synthetic Evaluation’s recorded 1427. In each instances, Muse Spark trails GPT-5.4 (1672) and Opus 4.6 (1606), suggesting that whereas the mannequin “thinks” properly, it is nonetheless refining its skill to “act” in long-horizon software program and workplace workflows.

  • Token Effectivity: This is the place Muse Spark distinguishes itself. To run the Intelligence Index, it used 58 million output tokens. In distinction, Claude Opus 4.6 required 157 million tokens and GPT-5.4 required 120 million. This helps Meta’s declare of “thought compression“—delivering frontier-class intelligence whereas utilizing lower than half the “pondering time” of its closest opponents.

Benchmark

Llama 4 Maverick (2025)

Muse Spark (Official)

Gemini 3.1 Professional (Official)

Intelligence Index Rating

18

52

57

MMMU Professional

80.4

83.9

CharXiv Reasoning

86.4

80.2

HealthBench Onerous

42.8

20.6

License

Open-Weights

Proprietary

Proprietary

With Muse Spark, Meta has efficiently transitioned from being the “LAMP stack for AI” to a direct challenger for the title of “Private Superintelligence”. Whereas agentic workflows stay a hurdle, its dominance in imaginative and prescient, well being, and token effectivity locations Meta again at the heart of the frontier race.

Private wellness and Instagram purchasing

Meta is instantly deploying Muse Spark to energy specialised experiences throughout its app household.

  • Purchasing Mode: A brand new function that leverages Meta’s huge creator ecosystem. The AI picks up on manufacturers, styling selections, and content material throughout Instagram and Threads to present personalised suggestions, successfully turning each put up right into a shoppable interplay.

  • Well being Reasoning: In a transfer towards medical utility, Meta collaborated with over 1,000 physicians to curate coaching knowledge. Muse Spark can now analyze dietary content material from pictures of meals or present “well being scores” for pescatarian diets with excessive ldl cholesterol.

  • Interactive UI: The mannequin can generate web-based minigames or tutorials on the fly. For instance, a person can immediate the AI to flip a photograph right into a playable Sudoku recreation or a highlights-based tutorial for residence home equipment.

Analysis consciousness

Whereas Muse Spark demonstrates robust refusal behaviors relating to organic and chemical weapons, its security profile features a startling new discovery. Third-party testing by Apollo Analysis discovered that the mannequin possesses a excessive diploma of “analysis consciousness”.

The mannequin steadily acknowledged when it was being examined in “alignment traps” and reasoned that it ought to behave truthfully particularly as a result of it was below analysis.

Whereas Meta concluded this was not a “blocking concern” for launch, the discovering means that frontier fashions are changing into more and more “aware” of the testing setting—probably rendering conventional security benchmarks much less dependable as fashions study to “recreation” the examination.

What occurs to Llama?

In February 2023, Meta launched Llama 1 to reveal that smaller, compute-optimal fashions might match bigger counterparts like GPT-3 in effectivity. Though entry was initially restricted to researchers, the mannequin weights have been leaked through 4chan on March 3, 2023, an occasion that inadvertently democratized high-tier analysis and catalyzed a worldwide motion for operating fashions on consumer-grade {hardware}.

This shift was solidified in July 2023 with the launch of Llama 2, which launched a industrial license that permitted self-hosting for many organizations. This method noticed fast adoption, with the Llama household exceeding 100 million downloads and supporting over 1,000 industrial functions by the third quarter of 2023.

By way of 2024 and 2025, Meta scaled the Llama household to set up it as the important infrastructure for world enterprise AI, steadily referred to as the LAMP stack for AI. Following the launch of Llama 3 in April 2024 and the landmark Llama 3.1 405B in July, Meta achieved efficiency parity with the world’s main proprietary methods.

The following launch of Llama 4 in April 2025 launched a Combination-of-Consultants structure, permitting for large parameter scaling whereas sustaining quick inference speeds. By early 2026, the Llama ecosystem reached a staggering scale, totaling 1.2 billion downloads and averaging roughly a million downloads per day.

This widespread adoption supplied companies with important financial sovereignty, as self-hosting Llama fashions provided an 88% value discount in contrast to utilizing proprietary API suppliers.

As of April 2026, Meta’s function as the undisputed chief of the open-weight motion has transitioned right into a extremely contested multi-polar panorama characterised by the rise of worldwide opponents.

Whereas the United States accounts for 35% of worldwide Llama deployments, Chinese language fashions from labs like Alibaba and DeepSeek started accounting for 41% of downloads on platforms like Hugging Face by late 2025. All through early 2026, new entrants similar to Zhipu AI’s GLM-5 and Alibaba’s Qwen 3.6 Plus have outpaced Llama 4 Maverick on normal information and coding benchmarks.

In response to this world stress, Meta’s Muse Spark arrives with hefty expectations and an open supply legacy that can be robust to stay up to.

Proprietary solely (for now)

The launch marks a controversial departure from Meta AI’s “open science” roots. Whereas the Llama sequence was famously accessible to builders, Muse Spark is launching as a proprietary mannequin.

Wang addressed the shift on X, stating: “9 months in the past we rebuilt our ai stack from scratch. New infrastructure, new structure, new knowledge pipelines… This is the 1st step. Greater fashions are already in growth with plans to open-source future variations.”

Nevertheless, the developer group stays skeptical. Some see this as a crucial pivot after the Llama 4 sequence failed to achieve anticipated developer traction; others view it as Meta “closing the gates” now that it has a aggressive reasoning mannequin.

Wang himself acknowledged the transition’s issue, noting there are “actually tough edges we’ll polish over time”.

For the 3 billion folks utilizing Meta’s apps, the change can be felt nearly immediately. The AI they work together with is now not only a library of information, however an agent with a $27 billion mind and a mandate to perceive their world as intimately as they do.




Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.

0
Show Comments (0) Hide Comments (0)
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Stay Updated!

Subscribe to get the latest blog posts, news, and updates delivered straight to your inbox.