
There is no scarcity of AI benchmarks in the market at this time, with standard choices like Humanity’s Last Exam (HLE), ARC-AGI-2 and GDPval, amongst quite a few others.
AI brokers excel at fixing summary math issues and passing PhD-level exams that almost all benchmarks are primarily based on, however Databricks has a query for the enterprise: Can they really deal with the document-heavy work most enterprises want them to do?
The reply, in accordance to new analysis from the knowledge and AI platform firm, is sobering. Even the best-performing AI brokers obtain lower than 45% accuracy on duties that mirror actual enterprise workloads, exposing a crucial hole between educational benchmarks and enterprise actuality.
“If we focus our analysis efforts on getting higher at [existing benchmarks], then we’re most likely not fixing the proper issues to make Databricks a greater platform,” Erich Elsen, principal analysis scientist at Databricks, defined to VentureBeat. “In order that’s why we had been wanting round. How will we create a benchmark that, if we get higher at it, we’re truly getting higher at fixing the issues that our clients have?”
The consequence is OfficeQA, a benchmark designed to take a look at AI brokers on grounded reasoning: Answering questions primarily based on advanced proprietary datasets containing unstructured paperwork and tabular knowledge. Not like current benchmarks that focus on summary capabilities, OfficeQA proxies for the economically worthwhile duties enterprises truly carry out.
Why educational benchmarks miss the enterprise mark
There are quite a few shortcomings of standard AI benchmarks from an enterprise perspective, in accordance to Elsen.
HLE options questions requiring PhD-level experience throughout numerous fields. ARC-AGI evaluates summary reasoning by visible manipulation of coloured grids. Each push the frontiers of AI capabilities, however do not replicate every day enterprise work. Even GDPval, which was particularly created to consider economically helpful duties, misses the goal.
“We come from a reasonably heavy science or engineering background, and generally we create evals that replicate that,” Elsen mentioned. ” In order that they’re both extraordinarily math-heavy, which is an amazing, helpful activity, however advancing the frontiers of human arithmetic is not what clients are making an attempt to do with Databricks.”
Whereas AI is generally used for buyer help and coding apps, Databricks’ buyer base has a broader set of necessities. Elsen famous that answering questions on paperwork or corpora of paperwork is a typical enterprise activity. These require parsing advanced tables with nested headers, retrieving information throughout dozens or a whole bunch of paperwork and performing calculations the place a single-digit error can cascade into organizations making incorrect enterprise selections.
Constructing a benchmark that mirrors enterprise doc complexity
To create a significant take a look at of grounded reasoning capabilities, Databricks wanted a dataset that approximates the messy actuality of proprietary enterprise doc corpora, whereas remaining freely accessible for analysis. The staff landed on U.S. Treasury Bulletins, printed month-to-month for 5 a long time starting in 1939 and quarterly thereafter.
The Treasury Bulletins test each field for enterprise doc complexity. Every bulletin runs 100 to 200 pages and consists of prose, advanced tables, charts and figures describing Treasury operations: The place federal cash got here from, the place it went and the way it financed authorities operations. The corpus spans roughly 89,000 pages throughout eight a long time. Till 1996, the bulletins had been scans of bodily paperwork; afterwards, they had been digitally produced PDFs. USAFacts, a corporation whose mission is “to make authorities knowledge simpler to entry and perceive,” partnered with Databricks to develop the benchmark, figuring out Treasury Bulletins as splendid and making certain questions mirrored practical use instances.
The 246 questions require brokers to deal with messy, real-world doc challenges: Scanned photographs, hierarchical desk constructions, temporal knowledge spanning a number of experiences and the want for external data like inflation changes. Questions vary from easy worth lookups to multi-step evaluation requiring statistical calculations and cross-year comparisons.
To make sure the benchmark requires precise document-grounded retrieval, Databricks filtered out questions that LLMs might reply utilizing parametric data or internet search alone. This eliminated less complicated questions and a few surprisingly advanced ones the place fashions leveraged historic monetary data memorized throughout pre-training.
Each query has a validated floor reality reply (usually a quantity, generally dates or small lists), enabling automated analysis with out human judging. This design alternative issues: It permits reinforcement studying (RL) approaches that require verifiable rewards, comparable to how fashions practice on coding issues.
Present efficiency exposes elementary gaps
Databricks examined Claude Opus 4.5 Agent (utilizing Claude’s SDK) and GPT-5.1 Agent (utilizing OpenAI’s File Search API). The outcomes ought to give pause to any enterprise betting closely on present agent capabilities.
When supplied with uncooked PDF paperwork:
Nevertheless, efficiency improved noticeably when supplied with pre-parsed variations of pages utilizing Databricks’ ai_parse_document, indicating that the poor uncooked PDF efficiency stems from LLM APIs scuffling with parsing somewhat than reasoning. Even with parsed paperwork, the experiments present room for enchancment.
When supplied with paperwork parsed utilizing Databricks’ ai_parse_document:
Three findings that matter for enterprise deployments
The testing recognized crucial insights for practitioners:
Parsing stays the elementary blocker: Advanced tables with nested headers, merged cells and weird formatting steadily produce misaligned values. Even when given precise oracle pages, brokers struggled primarily due to parsing errors, though efficiency roughly doubled with pre-parsed paperwork.
Doc versioning creates ambiguity: Monetary and regulatory paperwork get revised and reissued, that means a number of legitimate solutions exist relying on the publication date. Brokers usually cease looking out as soon as they discover a believable reply, lacking extra authoritative sources.
Visible reasoning is a spot: About 3% of questions require chart or graph interpretation, the place present brokers persistently fail. For enterprises the place knowledge visualizations talk crucial insights, this represents a significant functionality limitation.
How enterprises can use OfficeQA
The benchmark’s design allows particular enchancment paths past easy scoring.
“Because you’re ready to have a look at the proper reply, it is simple to inform if the error is coming from parsing,” Elsen defined.
This automated analysis allows speedy iteration on parsing pipelines. The verified floor reality solutions additionally allow RL coaching comparable to coding benchmarks, since there isn’t any human judgment required.
Elsen mentioned the benchmark gives “a very sturdy suggestions sign” for builders working on search options. Nevertheless, he cautioned towards treating it as coaching knowledge.
“At the very least in my creativeness, the purpose of releasing this is extra as an eval and not as a supply of uncooked coaching knowledge,” he mentioned. “When you tune too particularly into this atmosphere, then it is not clear how generalizable your agent outcomes can be.”
What this implies for enterprise AI deployments
For enterprises at present deploying or planning document-heavy AI agent techniques, OfficeQA gives a sobering actuality test. Even the newest frontier fashions obtain solely 43% accuracy on unprocessed PDFs and fall wanting 70% accuracy even with optimum doc parsing. Efficiency on the hardest questions plateaus at 40%, indicating substantial room for enchancment.
Three quick implications:
Consider your doc complexity: In case your paperwork resemble the complexity profile of Treasury Bulletins (scanned photographs, nested desk constructions, cross-document references), count on accuracy properly beneath vendor advertising and marketing claims. Take a look at on your precise paperwork before manufacturing deployment.
Plan for the parsing bottleneck: The take a look at outcomes point out that parsing stays a elementary blocker. Finances time and sources for customized parsing options somewhat than assuming off-the-shelf OCR will suffice.
Plan for laborious query failure modes: Even with optimum parsing, brokers plateau at 40% on advanced multi-step questions. For mission-critical doc workflows that require multi-document evaluation, statistical calculations or visible reasoning, present agent capabilities might not be prepared with out vital human oversight.
For enterprises wanting to lead in AI-powered doc intelligence, this benchmark gives a concrete analysis framework and identifies particular functionality gaps that want fixing.
Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.