BORE

Business Operations Real Eval

A benchmark for the document work that businesses actually hand to AI. Rankings come from code, not opinion — no LLM judge, no fuzzy similarity matching, no quiet manual review. Every score traces back to an exact check against ground truth, and you can reproduce all of it offline from the model outputs we publish.

The short version: as of May 2026, GPT-5.5 is on top at 85.1% task completion, ahead of Claude Opus 4.8 (83.8%) and Gemini 2.5 Pro (83.7%). That's across 13 models and 5 business-operations workflows.

Why we built this

Most leaderboards measure something other than the job. LMSYS captures chatbot feel, MMLU tests textbook knowledge, AutomationBench wires up SaaS APIs. None of them answer the question a business actually asks: can this model read my messy documents, apply my rules, and give me output I can check? That's what BORE measures. It's narrower than most leaderboards, and a lot harder to argue with.

Judge-free rankingsUpdated May 2026Dataset v1.113 models · 5 tasks13,650 scored units

Leaderboard

Ranked on Score — the share of deterministic checks each model passes (its completion_rate). We also track reasoning quality, schema strictness, and factual fidelity, but those are diagnostics and never move the ranking. bank_reconciliation is held back from the public board until an accountant signs off on the ground truth.

#ModelProviderScore
1GPT-5.5OpenAI85.1%
2Claude Opus 4.8Anthropic83.8%
3Gemini 2.5 ProGoogle83.7%
4o4-miniOpenAI83.2%
5Claude Opus 4.7Anthropic82.9%
6o3OpenAI82.7%
7Gemini 2.5 FlashGoogle82.0%
8Claude Sonnet 4.6Anthropic81.0%
9GPT-4.1OpenAI80.6%
10Claude Haiku 4.5Anthropic80.5%
11GPT-4oOpenAI79.7%
12GPT-4.1 MiniOpenAI79.5%
13GPT-4.1 NanoOpenAI66.5%

What We Test

Five document workflows that companies are already running on AI — the unglamorous middle-office work, not API plumbing or chat polish. Each one ends in structured output you can check, so each one gets scored by code instead of opinion.

Email Triage

Classify each inbound message (complaint, question, cancellation, escalation), score urgency, and route it to the right team. Tier 1 is a clean one-line request; Tier 3 is a forwarded chain where the real ask is buried three replies deep and contradicts the subject line.

e.g. "FW: RE: RE: Quick question" — actually a legal threat buried in paragraph four

Scored by: Exact enum match on classification, urgency, and route. Action narratives are diagnostic only.

Intake Parsing

Pull structured fields from messy multi-page documents — patient intake, vendor applications, insurance submissions. Names, dates, policy numbers, and diagnoses are scattered across narrative text and inconsistent formatting. Higher tiers add missing fields, contradictory entries, and data split across pages.

e.g. A 6-page patient intake where the allergies on page 2 contradict page 5

Scored by: Field-level exact match and ID/set F1 against ground-truth records.

Rule Application

Given extracted data and a set of business rules, make the right call — reimbursement amounts, compliance flags, eligibility. The model has to cite which rule IDs it applied, resolve conflicting rules, and get the arithmetic right. Higher tiers require inferring which rules are relevant; they are not handed to the model.

e.g. Qualifies under Rule 3.2a but is excluded by the age override in Appendix C

Scored by: Cited rule-ID set F1 plus cent-level numeric checks on computed amounts.

Journal Entry

Translate financial events into journal entries against a provided chart of accounts: select the right accounts, set debit/credit directions, compute amounts, assign the posting period, and flag issues. Higher tiers move from simple cash receipts to accruals, capitalization policy, reversing and intercompany entries.

e.g. A prepaid-expense recognition that must balance to the cent and post to the correct period

Scored by: Account match, debit=credit balance check, cent-level amount checks, exact period match, issue F1.

Close Checklist

Verify month-end close items against supporting data, separate genuine blockers from non-issues, and make the overall readiness call. Higher tiers add contradictory evidence, materiality thresholds, stale data, and circular dependencies.

e.g. Preparer marks an item done, but the supporting numbers don't reconcile

Scored by: Per-item status match, blocker recall/precision, and exact overall-readiness match.

5-Tier Difficulty System

Every task runs across the same difficulty scale. Plenty of models breeze through Tier 1 and then fall apart at Tier 3 and up — that is where the real differences show up. The per-tier curves let you see exactly where each model starts to slip.

Tier 1Clean baseline
Tier 2Minor ambiguity
Tier 3Realistic messiness
Tier 4Implicit context
Tier 5Adversarial

Methodology

Nothing here rides on vibes or preference votes, and no LLM sits anywhere in the ranking path. Every score comes out of scoring code run against ground truth, and anyone can re-run it offline and land on the same numbers.

Public rankings are computed without LLM judges, semantic similarity, or hidden manual review. Every ranked metric is derived from exact enums, IDs, tuples, or cent-level numeric checks, and every published score can be reproduced offline from released model outputs.

How ranked scores are computed

Exact enums

Classification, urgency, route, status, and readiness calls are matched against a fixed label set. No partial credit for close-enough wording.

ID & set F1

Extracted fields and cited rule IDs are scored as set membership — precision and recall against the ground-truth record.

Tuple matching

Journal entries are scored as (account, direction, amount, period) tuples that must line up with the expected posting.

Cent-level numerics

Reimbursements, amounts, and balances are checked to the cent. Off-by-a-penny is wrong, not rounded away.

Balance checks

Hard binary gates — e.g. debits must equal credits — applied before an entry can earn credit.

Provenance-gated

Every result carries scorer, schema, prompt, and git hashes. The board refuses to mix incompatible scorer versions within a task.

What we deliberately do not rank

We do not rank freeform reasoning or narrative quality when the ground truth lacks a defensible ontology. The benchmark rewards checkable work: routing decisions, extracted fields, cited rule IDs, journal-entry tuples, checklist statuses, and arithmetic.

  • Reasoning & narrative fields are diagnostic only. With 97 unique email actions and 87 unique accounting-issue narratives, we don't pretend they form a clean ontology — so they never enter the ranking.
  • Schema strictness (schema_ok) is reported separately, not ranked. Parse failures score zero, but schema-invalid-yet-parseable JSON is still scored on its content.
  • Factual fidelity & graceful degradation are tracked as signals to read alongside the score — never baked into it.

Reproducible offline

Raw model outputs, parsed outputs, parse/schema flags, and provenance hashes ship as predictions.jsonl. Anyone can re-score without API keys or model calls and land on the same numbers.

When a scoring bug surfaced (a wrapper-recovery edge case), we fixed the scorer deterministically and re-scored from saved predictions — zero re-runs, zero new model calls.

Provenance

This run, deterministically pinned. Results that don't share a scorer version within a task are never mixed.

git710dcf4scorera95fbb7schemab80d040promptf959139datasetv1.1replicates5 per case, temp 0

Scope & honest limits

BORE is narrow on purpose. Here's what these numbers actually tell you, and what they don't.

  • v1.1 covers five business-document tasks.
  • Scores the best model on this deterministic v1.1 benchmark.
  • Synthetic data, authored from deployment knowledge — no real-client data ever touches the board.
  • Does not measure reasoning quality, and is not fully representative of live enterprise performance.
  • bank_reconciliation is held pending audit; schema strictness is diagnostic, not ranked.

Need help choosing the right model for your workflow?

Putting AI agents into real business operations is what we do for a living. The team that built BORE can help you choose a model and actually get it running in production.