May 2026 Benchmarks

The Harness Moves the Score

Read the article →

Median harness Δ 15.6 pp, max 48 pp, on the same model

View in article →

The gap varies by benchmark, biggest where a purpose-built scaffold faces a generic agent

View in article →

y > x: the specialised scaffold wins, on the same model

Source: HAL leaderboards (GAIA, SWE-bench Verified Mini, TAU-bench Airline, CORE-Bench Hard, ScienceAgentBench).
View in article →

Claude Code adds 18–36 pp over a generic agent on the same Claude model

Source: HAL CORE-Bench Hard. Same model + Claude Code (Anthropic) vs same model + CORE-Agent (Princeton).
View in article →

Cost efficiency: harness Δ vs USD per task, log-scale x

Source: HAL leaderboards, COST (USD) column. 43 of 64 pairs have published cost for both harnesses.
View in article →

Smarter models do not get a smaller harness lift

AAII source: Artificial Analysis snapshot 2026-05-09. Harness Δ from this dataset (38 of 64 pairs matched).
View in article →

The gap is widening, not narrowing

Release dates from Artificial Analysis and model release notes.
View in article →

The premium grew from +20 pp to +23 pp

Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Smoothing: LOWESS, 60% span over day-precise release dates.
View in article →

Two benchmarks where the harness is winning, one where the model is

Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Bootstrap CIs computed on this dataset (n = 4 to 14 per benchmark).
View in article →
April 2026 Markets

How Public Markets Already Own the AI Frontier

Read the article →

Public exposure to Anthropic

Source: Synopticon ownership cards. Stake values gross of tax leakage.
View in article →

Public exposure to OpenAI

Source: Synopticon ownership cards. Microsoft fair value; carrying value materially lower (see next section).
View in article →
April 2026 Benchmarks

How Long Can Claude Mythos Work Alone?

Read the article →

METR horizon forecast

View in article →

Reasoning-era models: Mythos predicted at ~16-hour task complexity

Data: METR-Horizon-v1.1 × Self-reported IRT (Ho et al.). Band = 90% bootstrap CI (400 resamples). Y-axis = human-equivalent task duration.
View in article →

Full dataset (n=19): linear and quadratic diverge on older models

View in article →

At the 80% reliability bar, Mythos predicted at ~2–3 task-hours

View in article →

Six benchmarks predict 5–18 task-hours (median 10.5h)

View in article →
March 2026 Adoption

Moonshot Built the Engine. Cursor Sold the Car.

Read the article →

API pricing across coding models

View in article →

Competitive on coding benchmarks, despite the price gap

K2.5: HuggingFace, arXiv 2602.02276 · Composer 2: VentureBeat · SWE-bench: swebench.com · Terminal-Bench: tbench.ai · LiveCodeBench: livecodebench.github.io
Caveats: K2.5 benchmarks are self-reported from model card/paper, not yet on public SWE-bench or Aider leaderboards. SWE-bench Verified excluded: OpenAI retired it in February 2026 citing contamination.
View in article →

Each step in the pipeline adds measurable performance

K2 Instruct: HuggingFace model card, arXiv 2507.20534, tbench.ai leaderboard · K2.5: arXiv 2602.02276 · Composer 2: Cursor, VentureBeat
All scores confirmed from model cards, tech reports, or public leaderboards. K2 Terminal-Bench 2.0 score (27.8%) from tbench.ai (Terminus 2 agent). SWE-bench Multilingual K2 score (47.3%) from K2 tech report (agentic mode).
View in article →

How model mix would affect margin (thought experiment)

Cursor ARR: TechCrunch, SaaStr · Pricing: Cursor, Anthropic · SaaS benchmark: Bessemer
All figures illustrative. Actual cost structure not publicly disclosed.
View in article →
Tap +/− to zoom · scroll/swipe to pan · × to close