How much of a frontier model's benchmark score is the model, and how much is the harness around it? To find out, we collected 64 same-model, different-harness pairs from public leaderboards across 9 agentic benchmarks. The median gap is ~16 percentage points.
Synopticon Research · May 11, 2026
On CORE-Bench Hard, Claude Opus 4.5 scores 42% with Princeton's baseline CORE-Agent and 78% with Anthropic's Claude Code. The model and the tasks are identical; the harness around the model (its tools, its agent loop, its system prompt) is not. We collected 63 more same-model pairs from public leaderboards across 9 agentic benchmarks. The gap belongs to the harness.
Most pairs come from HAL, Princeton's Holistic Agent Leaderboard, which runs each model under multiple frameworks on standardised infrastructure across six benchmarks: GAIA (web research), SWE-bench Verified Mini (coding), CORE-Bench Hard (scientific reproducibility), TAU-bench Airline (customer service), Online Mind2Web (web tasks), and ScienceAgentBench. The rest come from swe-bench/experiments (Verified, Lite, Multimodal) and the ARC Prize board.
The aggregate median is 15.6 pp. The per-benchmark spread is wider.
Each bar is the median of |delta_pp| across that benchmark's same-model pairs. SWE-bench Verified Mini has the most pairs (n = 14) and the widest spread; Online Mind2Web (n = 6) the narrowest.
The harness Δ is biggest where the leaderboard is still being figured out and smallest where it has settled on a winning design. The top benchmarks (Mini, GAIA, TAU-Airline) host both purpose-built scaffolds and generic agents competing on the same model; different design choices produce different scores. The bottom benchmarks (Online Mind2Web, SWE-bench Verified) host scaffolds that have converged on similar designs, so swapping the harness barely moves the score.
A purpose-built scaffold encodes three things a generic agent does not. A tool surface fitted to the benchmark: file-edit and pytest tools for code, DOM and click tools for web. An agent loop tuned to that task class's failure modes: re-plan after a failed test, re-locate after a navigation change. And a system prompt that frames the task (role, output format, how to call the tools) so the model spends compute on solving rather than discovering the harness.
A generic agent has none of that. It has to discover the right tool, the right loop, and the right framing inside the eval. The harness Δ is the cost of that discovery. Web tasks have converged on Browser-Use and SeeAct, so the gap is small. Code tasks have not: SWE-Agent, Agentless, OpenHands, ACoder, and Refact.ai encode different bets, and those bets produce different scores.
If specialised scaffolds reliably beat generic agents, every dot should sit above the y = x diagonal.
The subset is 36 pairs where one side is benchmark-specific (SWE-Agent, TAU-bench Tool Calling, Claude Code, CORE-Agent, SAB Self-Debug) and the other is generic (HAL Generalist, HF Open Deep Research, RAG, direct prompting). Pairs where both sides are specialised or both generic are excluded; they don't test the question.
On the benchmarks where the comparison is possible, the harness layer accounts for 30–50% of the score.
Both on-diagonal dots are Online Mind2Web, the most converged benchmark. Where design space is still open (code, customer service, scientific reproducibility), the specialised harness does 30–50% of the work. A model-only baseline does not represent what the same model can do under a real scaffold.
Anthropic builds both Claude and Claude Code. If integration is a real moat, the same Claude should beat itself under a third-party harness. Anthropic is the only frontier lab whose same-model harness premium is in the public record in a format we can test; the OpenAI and Google equivalents come later.
CORE-Bench Hard (scientific code reproducibility, 45 tasks) is the public benchmark that runs the same Claude under both Claude Code and a third-party baseline (CORE-Agent). HAL has reported this for three Claude generations. Claude Code wins by 18–36 pp in all three.
The premium grows with the model. Opus 4.5 gains 35.6 pp, Sonnet 4 gains 17.8 pp. Newer Claudes get more from Claude Code, not less. An 18–36 pp swing from one harness choice, invisible in raw model benchmarks. OpenAI has not published Codex-CLI-vs-other-harness on the same OpenAI model; Google has published informal Jules-vs-Gemini-CLI numbers but nothing in this format. The integration-moat hypothesis is tested on one lab so far; the other two await data.
Harness cost spans $1 to $1,600 per task on HAL. If you pay 1,000× more, do you get 1,000× more Δ?
COST (USD) column. 43 of 64 pairs have published cost for both harnesses.If price predicted performance we would see a clear upward trend. We see scatter. The most efficient pair is DeepSeek V3 on ScienceAgentBench: +14.7 pp for $2.09 per task, or 7.0 pp per dollar. The least efficient is Browser-Use on Online Mind2Web with Claude models: $1,150–$1,577 per task for 2–10 pp.
Five of the top six pairs by pp-per-dollar are TAU-bench Tool Calling vs HAL Generalist Agent on customer-service tasks. Design choices that score higher also cost less, the opposite of what catalog pricing would predict. Specialisation wins twice: quality and price.
Two natural objections both predict a shrinking harness Δ. (a) Smarter models internalise the work scaffolds do. (b) Scaffold design matures and closes the headroom on base models. Both predict the wrong sign.
AAII is Artificial Analysis's composite index across reasoning, coding, math, and tool-use evals: a single number scoring frontier models from ~10 (GPT-4o-mini class) to ~60 (GPT-5.5 Pro class). We matched 38 of 64 pairs directly. The 26 unmatched are mostly older Claude 4 generations, now superseded.
If smarter models needed less help, we would see a negative slope. We see a faintly positive one. The harness premium is invariant to model capability over the range we sampled. A GPT-5.5-class model under the right scaffold beats itself under the wrong one by the same margin a Claude 3-class model would.
Each dot is anchored to its base model's release date, not the harness submission date. Data spans Mar 2023 (GPT-4 1106) to Oct 2025 (Claude Sonnet 4.5). Newer models are over-represented because they have more leaderboard submissions.
The convergence story (scaffolds hit diminishing returns as base models improve) predicts a falling slope. We get a rising one. Scaffold research is outpacing model research on agentic tasks. Caveat: most new scaffolds in our dataset (SWE-Agent, OpenHands, Refact.ai, EPAM AI/Run, ACoder) are code-specific, where the design space is widest. Whether the trend holds outside code is the open question.
The previous section showed the gap is not closing. The pooled view quantifies how much it has grown, and the per-benchmark view shows where. Smoothed-median trends across 64 pairs and nine benchmarks put the gap between vanilla and best-harness scores at +20 pp in late 2023, widening to +23 pp by late 2025.
The vanilla scaffold's median rose from ~5% (best a 2023 model could do in a bare agent loop) to ~42% (best from a late-2025 Claude in the same loop). The best-harness median rose from ~26% to ~76%. The harness kept adding more on top, not less.
That is the pooled view. Decompose per benchmark and the picture sharpens: for each eval, fit two OLS slopes over release date and plot each benchmark as one dot in slope-vs-slope space.
Three benchmarks have CIs that don't cross the diagonal. CORE-Bench Hard (harness +6.0 pp/mo, vanilla +3.2) and GAIA (harness +4.2, vanilla +0.3) sit above the diagonal; the harness is winning on both. SWE-bench Lite sits below; the model is. The other six cluster on the diagonal with overlapping CIs; their gaps are indistinguishable from zero at n = 4–14.
Where the scaffold's design space is largely solved, base-model gains dominate; where it isn't, the scaffold keeps pulling ahead. SWE-bench is the most-studied agentic eval, with six years of harness iteration. CORE-Bench is newer, less crowded, and its best entry is Claude Code, which Anthropic redesigned for research replication six months ago. GAIA's top scaffold is the HAL Generalist Agent, similarly young. Whether the pattern survives more releases is the open question.
Two caveats. N is small (4 to 14 per benchmark); the wide error crosses on most dots make this visible. And CORE-Bench's "harness winning" result is Anthropic's home turf. Same lab makes both the model and the harness, so the slope can read as scaffold research outpacing model research, or as one vertically integrated lab outpacing the field.
Three recent papers measure the same phenomenon at different layers of the agent stack. Together they corroborate our finding, sharpen one of the implications, and quantify a limitation.
A skill is not a harness. A harness wraps the model and runs the agent loop. A skill is a composable module (a markdown file plus optional scripts) that the harness loads at runtime to specialise for a task. The harness sits one layer above the model; the skill sits one layer below the harness.
SkillsBench (Feb 2026) measures the lift from curated skills across seven model × harness configurations on 84 tasks across 11 domains. Their average lift is +16.2 pp. Our dataset reports a median +15.6 pp at the layer above. Two independent groups, two layers apart, the same magnitude.
| Layer | What it adds | Paper | Effect |
|---|---|---|---|
| Harness | Agent loop that wraps the model | This piece (64 pairs) | +15.6 pp median |
| Skill | Composable module the harness loads at runtime | SkillsBench (7 configs × 84 tasks) | +16.2 pp average |
The agent stack compounds. Each layer contributes roughly 15 pp on top of the model on the kinds of tasks both papers measure.
SkillsBench reports that Claude Code shows the most consistent skill uplift (+13.9 to +23.3 pp), and that Codex CLI "frequently neglects provided Skills". Gemini CLI also uses skills reliably (+13.6 to +17.4 pp) in their tests. So the cleaner reading is: Claude Code and Gemini CLI consume the skill layer; Codex CLI does not. The Anthropic premium claim is narrower than "best on every front". It is best in the public same-model harness comparison we have.
SoK: Agentic Skills (Feb 2026) adds two pieces of nuance. First, self-generated skills (skills an agent writes for itself) degrade performance by an average −1.3 pp. Curation matters. Second, per-domain variance is large: healthcare skills add +51.9 pp, manufacturing +41.9 pp, software engineering only +4.5 pp. The harness Δ also varies by benchmark (our Online Mind2Web median is 8 pp vs SWE-bench Verified Mini's 31 pp). Both layers reward domain-specific design effort.
SWE-Skills-Bench (Mar 2026) tests whether skills help on real-world software engineering rather than agentic benchmarks. The answer is mostly no. Of 49 curated skills tested on Claude Haiku 4.5 + Claude Code, 39 produced zero pass-rate improvement and the average gain was +1.2%. Three skills hurt performance. The leaderboard Δ shrinks roughly an order of magnitude on in-the-wild tasks.
OAgents (ICML 2025) dissects agent design components on GAIA and BrowseComp. Their finding: "the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs." Our dataset is a single snapshot per pair. Individual entries carry run-to-run noise. The medians and IQRs in this piece are less noisy than any single row.
Five claims follow from the data above. Each is anchored to a specific pair. None is a recommendation.
Model labs have raised more than $200B cumulatively. Harness-layer companies have raised a small fraction of that. The harness accounts for 30–50% of score on agentic benchmarks where the comparison is possible.
On CORE-Bench Hard, Claude Code beats a generic CORE-Agent by 18–36 pp across the three Claude generations HAL has tested with Claude Code. OpenAI and Google have not published equivalent same-model harness comparisons in this format. Model-only leaderboards miss this layer entirely.
The market has priced the harness layer in code. Cursor's last round was ~$29B (Nov 2025), with April 2026 reporting putting it in talks at ~$50B; Cognition closed at ~$10B (Sept 2025) and is reportedly in talks near $25B. The pricing reflects the same asymmetry the data shows: a public leaderboard with a measurable Δ on every release. Legal, clinical, finance, and accounting agents have no such leaderboard, so the Δ is unobserved and the corresponding pricing isn't there.
If a third-party harness explains 30%+ of the score and anyone can buy it, the model API is a commodity input. Cohere, Together, and Mistral La Plateforme are the most exposed. OpenRouter and Portkey are picks-and-shovels on the same trend.
SWE-Agent, OpenHands, Aider, and Browser-Use are open-source frameworks that have repeatedly anchored top SWE-bench submissions. The proprietary scaffolds at the top of the same boards (Live-SWE-agent, Augment, and others) share open-source design heritage. The frontier design lineage is in public repos, which means the data and the code behind a top score are both observable.
Four caveats, in order of how much they undercut the thesis.
A pair is two entries on the same public leaderboard where (1) the base model is identical after normalising version strings, (2) the benchmark is identical, and (3) the agentic framework differs. Reasoning-effort, sample-count, and skill-toggle changes are excluded; they are not framework changes.
Sources: HAL leaderboards (Princeton, six benchmarks, scraped via Playwright); swe-bench/experiments repository (Verified, Lite, Multimodal splits, scored as n_resolved / n_total, then lowest- vs highest-scoring system per model); ARC Prize leaderboard. Cost-efficiency: 43 / 64 pairs where HAL publishes COST (USD) for both harnesses. AAII source: Artificial Analysis snapshot 2026-05-09 (38 / 64 pairs matched).
Headline numbers are over all 64 pairs: median |Δ| 15.6 pp; mean 18.6 pp; p90 37.3 pp; max 48 pp. 70% of pairs see ≥10 pp, 39% see ≥20 pp.
Run-to-run noise. Each pair is a single snapshot. OAgents (ICML 2025) reports significant variance on the same model + agent + benchmark combination across re-runs. Medians and IQRs in this piece are less noisy than any single row.
Excluded from the dataset. OSWorld (113 entries parsed but no qualifying pairs, since variation comes from step-budget changes); Cybench (single entry per model); Anthropic / OpenAI internal evals (not public); ARC-AGI-2 (only one same-model pair available, dropped to avoid distorting per-benchmark statistics).
Skill vs harness. We measure framework-vs-framework effects at the harness layer. SkillsBench measures skill-vs-no-skill effects one layer below.
Scripts: research/harness-effect/scripts/. Dataset: research/harness-effect/data/harness_pairs.csv.