Charts — Synopticon

May 2026 Benchmarks

The Harness Moves the Score

Read the article →

Median harness Δ 15.6 pp, max 48 pp, on the same model

Sources: HAL leaderboards · swe-bench/experiments.

View in article →

The gap varies by benchmark, biggest where a purpose-built scaffold faces a generic agent

Sources: HAL leaderboards · swe-bench/experiments.

View in article →

y > x: the specialised scaffold wins, on the same model

Source: HAL leaderboards (GAIA, SWE-bench Verified Mini, TAU-bench Airline, CORE-Bench Hard, ScienceAgentBench).

View in article →

Claude Code adds 18–36 pp over a generic agent on the same Claude model

Source: HAL CORE-Bench Hard. Same model + Claude Code (Anthropic) vs same model + CORE-Agent (Princeton).

View in article →

Cost efficiency: harness Δ vs USD per task, log-scale x

Source: HAL leaderboards, COST (USD) column. 43 of 64 pairs have published cost for both harnesses.

View in article →

Smarter models do not get a smaller harness lift

AAII source: Artificial Analysis snapshot 2026-05-09. Harness Δ from this dataset (38 of 64 pairs matched).

View in article →

The gap is widening, not narrowing

Release dates from Artificial Analysis and model release notes.

View in article →

The premium grew from +20 pp to +23 pp

Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Smoothing: LOWESS, 60% span over day-precise release dates.

View in article →

Two benchmarks where the harness is winning, one where the model is

Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Bootstrap CIs computed on this dataset (n = 4 to 14 per benchmark).

View in article →

April 2026 Markets

How Public Markets Already Own the AI Frontier

Read the article →

Public exposure to Anthropic

Source: Synopticon ownership cards. Stake values gross of tax leakage.

View in article →

Public exposure to OpenAI

Source: Synopticon ownership cards. Microsoft fair value; carrying value materially lower (see next section).

View in article →

April 2026 Benchmarks

How Long Can Claude Mythos Work Alone?

Read the article →

METR horizon forecast

View in article →

Reasoning-era models: Mythos predicted at ~16-hour task complexity

Data: METR-Horizon-v1.1 × Self-reported IRT (Ho et al.). Band = 90% bootstrap CI (400 resamples). Y-axis = human-equivalent task duration.

View in article →

Full dataset (n=19): linear and quadratic diverge on older models

View in article →

At the 80% reliability bar, Mythos predicted at ~2–3 task-hours

View in article →

Six benchmarks predict 5–18 task-hours (median 10.5h)

View in article →

March 2026 Adoption

Moonshot Built the Engine. Cursor Sold the Car.

Read the article →

API pricing across coding models

OpenAI: openai.com/api/pricing · Anthropic: docs.anthropic.com · DeepSeek: api-docs.deepseek.com · Fireworks: fireworks.ai/kimi · Cursor: cursor.com/docs/models-and-pricing · Google: ai.google.dev

View in article →

Competitive on coding benchmarks, despite the price gap

K2.5: HuggingFace, arXiv 2602.02276 · Composer 2: VentureBeat · SWE-bench: swebench.com · Terminal-Bench: tbench.ai · LiveCodeBench: livecodebench.github.io
Caveats: K2.5 benchmarks are self-reported from model card/paper, not yet on public SWE-bench or Aider leaderboards. SWE-bench Verified excluded: OpenAI retired it in February 2026 citing contamination.

View in article →

Each step in the pipeline adds measurable performance

K2 Instruct: HuggingFace model card, arXiv 2507.20534, tbench.ai leaderboard · K2.5: arXiv 2602.02276 · Composer 2: Cursor, VentureBeat
All scores confirmed from model cards, tech reports, or public leaderboards. K2 Terminal-Bench 2.0 score (27.8%) from tbench.ai (Terminus 2 agent). SWE-bench Multilingual K2 score (47.3%) from K2 tech report (agentic mode).

View in article →

How model mix would affect margin (thought experiment)

Cursor ARR: TechCrunch, SaaStr · Pricing: Cursor, Anthropic · SaaS benchmark: Bessemer
All figures illustrative. Actual cost structure not publicly disclosed.

View in article →