Charts
All charts
Daily-tracked signals from across the AI capability race — model trajectories, enterprise routing shifts, benchmark events, training compute. New charts added as the data moves.
The Harness Moves the Score
Read the article →Median harness Δ 15.6 pp, max 48 pp, on the same model
Sources: HAL leaderboards · swe-bench/experiments.
View in article →
The gap varies by benchmark, biggest where a purpose-built scaffold faces a generic agent
Sources: HAL leaderboards · swe-bench/experiments.
View in article →
y > x: the specialised scaffold wins, on the same model
Source: HAL leaderboards (GAIA, SWE-bench Verified Mini, TAU-bench Airline, CORE-Bench Hard, ScienceAgentBench).
View in article →
Claude Code adds 18–36 pp over a generic agent on the same Claude model
Source: HAL CORE-Bench Hard. Same model + Claude Code (Anthropic) vs same model + CORE-Agent (Princeton).
View in article →
Cost efficiency: harness Δ vs USD per task, log-scale x
Source: HAL leaderboards,
View in article →
COST (USD) column. 43 of 64 pairs have published cost for both harnesses.Smarter models do not get a smaller harness lift
AAII source: Artificial Analysis snapshot 2026-05-09. Harness Δ from this dataset (38 of 64 pairs matched).
View in article →
The gap is widening, not narrowing
Release dates from Artificial Analysis and model release notes.
View in article →
The premium grew from +20 pp to +23 pp
Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Smoothing: LOWESS, 60% span over day-precise release dates.
View in article →
Two benchmarks where the harness is winning, one where the model is
Sources: HAL leaderboards (Princeton) · swe-bench/experiments. Bootstrap CIs computed on this dataset (n = 4 to 14 per benchmark).
View in article →
How Public Markets Already Own the AI Frontier
Read the article →Public exposure to Anthropic
Source: Synopticon ownership cards. Stake values gross of tax leakage.
View in article →
Public exposure to OpenAI
Source: Synopticon ownership cards. Microsoft fair value; carrying value materially lower (see next section).
View in article →
How Long Can Claude Mythos Work Alone?
Read the article →METR horizon forecast
View in article →Reasoning-era models: Mythos predicted at ~16-hour task complexity
Data: METR-Horizon-v1.1 × Self-reported IRT (Ho et al.). Band = 90% bootstrap CI (400 resamples). Y-axis = human-equivalent task duration.
View in article →
Full dataset (n=19): linear and quadratic diverge on older models
View in article →At the 80% reliability bar, Mythos predicted at ~2–3 task-hours
View in article →Six benchmarks predict 5–18 task-hours (median 10.5h)
View in article →Moonshot Built the Engine. Cursor Sold the Car.
Read the article →API pricing across coding models
OpenAI: openai.com/api/pricing ·
Anthropic: docs.anthropic.com ·
DeepSeek: api-docs.deepseek.com ·
Fireworks: fireworks.ai/kimi ·
Cursor: cursor.com/docs/models-and-pricing ·
Google: ai.google.dev
View in article →
Competitive on coding benchmarks, despite the price gap
K2.5: HuggingFace,
arXiv 2602.02276 ·
Composer 2: VentureBeat ·
SWE-bench: swebench.com ·
Terminal-Bench: tbench.ai ·
LiveCodeBench: livecodebench.github.io
Caveats: K2.5 benchmarks are self-reported from model card/paper, not yet on public SWE-bench or Aider leaderboards. SWE-bench Verified excluded: OpenAI retired it in February 2026 citing contamination.
View in article →
Caveats: K2.5 benchmarks are self-reported from model card/paper, not yet on public SWE-bench or Aider leaderboards. SWE-bench Verified excluded: OpenAI retired it in February 2026 citing contamination.
Each step in the pipeline adds measurable performance
K2 Instruct: HuggingFace model card,
arXiv 2507.20534,
tbench.ai leaderboard ·
K2.5: arXiv 2602.02276 ·
Composer 2: Cursor,
VentureBeat
All scores confirmed from model cards, tech reports, or public leaderboards. K2 Terminal-Bench 2.0 score (27.8%) from tbench.ai (Terminus 2 agent). SWE-bench Multilingual K2 score (47.3%) from K2 tech report (agentic mode).
View in article →
All scores confirmed from model cards, tech reports, or public leaderboards. K2 Terminal-Bench 2.0 score (27.8%) from tbench.ai (Terminus 2 agent). SWE-bench Multilingual K2 score (47.3%) from K2 tech report (agentic mode).
How model mix would affect margin (thought experiment)
Cursor ARR: TechCrunch,
SaaStr ·
Pricing: Cursor,
Anthropic ·
SaaS benchmark: Bessemer
All figures illustrative. Actual cost structure not publicly disclosed.
View in article →
All figures illustrative. Actual cost structure not publicly disclosed.