Business Operations Real Eval
A benchmark for the document work that businesses actually hand to AI. Rankings come from code, not opinion — no LLM judge, no fuzzy similarity matching, no quiet manual review. Every score traces back to an exact check against ground truth, and you can reproduce all of it offline from the model outputs we publish.
The short version: as of May 2026, GPT-5.5 is on top at 85.1% task completion, ahead of Claude Opus 4.8 (83.8%) and Gemini 2.5 Pro (83.7%). That's across 13 models and 5 business-operations workflows.
Most leaderboards measure something other than the job. LMSYS captures chatbot feel, MMLU tests textbook knowledge, AutomationBench wires up SaaS APIs. None of them answer the question a business actually asks: can this model read my messy documents, apply my rules, and give me output I can check? That's what BORE measures. It's narrower than most leaderboards, and a lot harder to argue with.
Ranked on Score — the share of deterministic checks each model passes (its completion_rate). We also track reasoning quality, schema strictness, and factual fidelity, but those are diagnostics and never move the ranking. bank_reconciliation is held back from the public board until an accountant signs off on the ground truth.
| # | Model | Provider | Score |
|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 85.1% |
| 2 | Claude Opus 4.8 | Anthropic | 83.8% |
| 3 | Gemini 2.5 Pro | 83.7% | |
| 4 | o4-mini | OpenAI | 83.2% |
| 5 | Claude Opus 4.7 | Anthropic | 82.9% |
| 6 | o3 | OpenAI | 82.7% |
| 7 | Gemini 2.5 Flash | 82.0% | |
| 8 | Claude Sonnet 4.6 | Anthropic | 81.0% |
| 9 | GPT-4.1 | OpenAI | 80.6% |
| 10 | Claude Haiku 4.5 | Anthropic | 80.5% |
| 11 | GPT-4o | OpenAI | 79.7% |
| 12 | GPT-4.1 Mini | OpenAI | 79.5% |
| 13 | GPT-4.1 Nano | OpenAI | 66.5% |
Five document workflows that companies are already running on AI — the unglamorous middle-office work, not API plumbing or chat polish. Each one ends in structured output you can check, so each one gets scored by code instead of opinion.
Classify each inbound message (complaint, question, cancellation, escalation), score urgency, and route it to the right team. Tier 1 is a clean one-line request; Tier 3 is a forwarded chain where the real ask is buried three replies deep and contradicts the subject line.
e.g. "FW: RE: RE: Quick question" — actually a legal threat buried in paragraph four
Scored by: Exact enum match on classification, urgency, and route. Action narratives are diagnostic only.
Pull structured fields from messy multi-page documents — patient intake, vendor applications, insurance submissions. Names, dates, policy numbers, and diagnoses are scattered across narrative text and inconsistent formatting. Higher tiers add missing fields, contradictory entries, and data split across pages.
e.g. A 6-page patient intake where the allergies on page 2 contradict page 5
Scored by: Field-level exact match and ID/set F1 against ground-truth records.
Given extracted data and a set of business rules, make the right call — reimbursement amounts, compliance flags, eligibility. The model has to cite which rule IDs it applied, resolve conflicting rules, and get the arithmetic right. Higher tiers require inferring which rules are relevant; they are not handed to the model.
e.g. Qualifies under Rule 3.2a but is excluded by the age override in Appendix C
Scored by: Cited rule-ID set F1 plus cent-level numeric checks on computed amounts.
Translate financial events into journal entries against a provided chart of accounts: select the right accounts, set debit/credit directions, compute amounts, assign the posting period, and flag issues. Higher tiers move from simple cash receipts to accruals, capitalization policy, reversing and intercompany entries.
e.g. A prepaid-expense recognition that must balance to the cent and post to the correct period
Scored by: Account match, debit=credit balance check, cent-level amount checks, exact period match, issue F1.
Verify month-end close items against supporting data, separate genuine blockers from non-issues, and make the overall readiness call. Higher tiers add contradictory evidence, materiality thresholds, stale data, and circular dependencies.
e.g. Preparer marks an item done, but the supporting numbers don't reconcile
Scored by: Per-item status match, blocker recall/precision, and exact overall-readiness match.
Every task runs across the same difficulty scale. Plenty of models breeze through Tier 1 and then fall apart at Tier 3 and up — that is where the real differences show up. The per-tier curves let you see exactly where each model starts to slip.
Nothing here rides on vibes or preference votes, and no LLM sits anywhere in the ranking path. Every score comes out of scoring code run against ground truth, and anyone can re-run it offline and land on the same numbers.
Public rankings are computed without LLM judges, semantic similarity, or hidden manual review. Every ranked metric is derived from exact enums, IDs, tuples, or cent-level numeric checks, and every published score can be reproduced offline from released model outputs.
Classification, urgency, route, status, and readiness calls are matched against a fixed label set. No partial credit for close-enough wording.
Extracted fields and cited rule IDs are scored as set membership — precision and recall against the ground-truth record.
Journal entries are scored as (account, direction, amount, period) tuples that must line up with the expected posting.
Reimbursements, amounts, and balances are checked to the cent. Off-by-a-penny is wrong, not rounded away.
Hard binary gates — e.g. debits must equal credits — applied before an entry can earn credit.
Every result carries scorer, schema, prompt, and git hashes. The board refuses to mix incompatible scorer versions within a task.
We do not rank freeform reasoning or narrative quality when the ground truth lacks a defensible ontology. The benchmark rewards checkable work: routing decisions, extracted fields, cited rule IDs, journal-entry tuples, checklist statuses, and arithmetic.
Raw model outputs, parsed outputs, parse/schema flags, and provenance hashes ship as predictions.jsonl. Anyone can re-score without API keys or model calls and land on the same numbers.
When a scoring bug surfaced (a wrapper-recovery edge case), we fixed the scorer deterministically and re-scored from saved predictions — zero re-runs, zero new model calls.
This run, deterministically pinned. Results that don't share a scorer version within a task are never mixed.
BORE is narrow on purpose. Here's what these numbers actually tell you, and what they don't.
Putting AI agents into real business operations is what we do for a living. The team that built BORE can help you choose a model and actually get it running in production.