On March 19, Cursor shipped its new coding model on top of a Chinese open-weight base. The pricing gap — and what it says about the API business model — is the real story.
A $29 billion company built its flagship model on a Chinese open-weight base — then forgot to mention it.
Cursor launched Composer 2 on March 19, 2026 to its 1M+ daily active users. The blog post credited "continued pre-training of a base model, combined with reinforcement learning." It did not name the base model.
Within 24 hours, a developer intercepted the model ID in Cursor's API responses: kimi-k2p5-rl-0317-s515-fast. The base was Kimi K2.5, an open-weight model from Moonshot AI in Beijing.
Within 72 hours, a developer discovered the model identity — and the pricing. Composer 2 charges $1.50/$7.50 per million tokens in its default "fast" mode. Claude Sonnet costs $3/$15. The gap was real, but narrower than it first appeared.
Composer 2 is cheaper than Claude — but it's not the cheapest option. The full landscape tells a more nuanced story.
Composer 2's standard tier ($0.50/$2.50) is not the default — the "fast" variant at $1.50/$7.50 is. At that price point, it's 2x cheaper than Sonnet, not 10x. And models like DeepSeek V3.2 ($0.28/$0.42) are already cheaper than Composer 2 with comparable coding scores.
The more relevant question isn't "is Composer 2 cheaper than Claude?" — it's "what else can you get at the same price point, and how does it compare?"
And yet, on coding benchmarks — the ones that actually predict IDE agent performance — Composer 2 trades blows with the expensive models. But first, a guide to which benchmarks actually matter:
| Benchmark | Independence | Contamination risk | Task realism | Importance |
|---|---|---|---|---|
| SWE-bench Verified | High | HIGH — OpenAI retired it Feb 2026 citing contamination | Low (median 4 lines) | Declining |
| SWE-bench Pro | High (Scale AI) | Low | High (107 lines, 4.1 files) | Rising — recommended replacement |
| SWE-bench Multilingual | High | Medium | Medium | Medium |
| Terminal-Bench 2.0 | High | Low (refreshed) | High (agentic) | High |
| LiveCodeBench | High | Low (fresh problems) | Medium | High |
| Next.js Evals | High (OSS, Vercel) | Low | High (framework migration) | Medium — narrow but practical |
| CursorBench | None (proprietary) | Low (internal codebase) | Very high (352 lines, 8 files, real user sessions) | High methodology, low verifiability |
CursorBench is arguably the best-designed benchmark on this list — real user sessions, agentic graders, refreshed quarterly, cross-validated against live traffic. The flaw is that only Cursor can run it.
Cursor explicitly chose not to report SWE-bench Verified, citing contamination — a position OpenAI now shares. But their absence from SWE-bench Pro (built by Scale AI specifically to fix Verified's problems) is harder to explain.
A quick guide to what each benchmark measures:
| Benchmark | What it tests | Link |
|---|---|---|
| SWE-bench Verified | Resolving real GitHub issues from 12 Python repos. 500 human-verified problems. | swebench.com |
| SWE-bench Multilingual | Same task, extended to JavaScript, TypeScript, Java, Go, Rust, C++. | swebench.com |
| Terminal-Bench 2.0 | Agentic terminal tasks — shell commands, file ops, system admin. Tests tool use. | tbench.ai |
| LiveCodeBench v6 | Fresh competitive programming problems. Tests raw code generation, not tooling. | livecodebench.github.io |
| Aider Polyglot | Multi-language code editing via chat. Tests edit-apply loop used by IDE agents. | aider.chat |
Composer 2 beats Claude Opus 4.6 on Terminal-Bench 2.0 (61.7 vs 59.3). It loses on SWE-bench Multilingual (73.7 vs 77.5). The raw K2.5 base — before Cursor's fine-tuning — leads on LiveCodeBench v6 (85.0 vs 82.2).
Cursor also publishes scores on CursorBench, their proprietary internal evaluation suite. It uses real user sessions sourced via "Cursor Blame" (tracing committed code to agent requests), with tasks averaging 352 lines across 8 files — substantially larger than SWE-bench tasks.
| Model | CursorBench | Terminal-Bench 2.0 | SWE-bench ML |
|---|---|---|---|
| Composer 2 | 61.3 | 61.7 | 73.7 |
| Claude Opus 4.6 | 58.2 | 58.0 | 77.8 |
| Composer 1.5 | 44.2 | 47.9 | 65.9 |
| Composer 1 | 38.0 | 40.0 | 56.9 |
CursorBench sources tasks from real developer sessions via "Cursor Blame," which traces committed code back to the agent request that produced it. Tasks average 352 lines across 8 files — substantially larger than SWE-bench. Grading uses agentic judges and is cross-validated against live traffic metrics. The methodology is arguably best-in-class for measuring "does this model actually help developers in an IDE." The problem is purely verifiability — only Cursor can run it.
Source: cursor.com/blog/cursorbench.
CursorBench is not public. A Cursor team member confirmed: "Unfortunately not, as we used our own internal code for the benchmark." Different evaluation harnesses were used per model, so cross-model comparisons are not apples-to-apples.
Academic research warns that proprietary benchmarks "shift epistemic authority to the curator."
Vercel's Next.js Evals (OSS on GitHub) test framework migration tasks — a practical, real-world workload. The results reveal how much the agent wrapper contributes on top of the raw model:
| Model | Agent | Baseline | With AGENTS.md |
|---|---|---|---|
| GPT 5.3 Codex | Codex | 86% | 100% |
| GPT 5.4 | Codex | 86% | 95% |
| Composer 2 | Cursor | 76% | 95% |
| Gemini 3.1 Pro | Gemini CLI | 76% | 100% |
| Claude Opus 4.6 | Claude Code | 71% | 100% |
| Claude Sonnet 4.6 | Claude Code | 67% | 100% |
| Kimi K2.5 | OpenCode | 19% | 52% |
Source: nextjs.org/evals, github.com/vercel/next-evals-oss
K2.5 scores 19% baseline; Composer 2 scores 76%. That's +57 points from Cursor's RL — far larger than the +0.7 on SWE-bench ML. But with documentation (AGENTS.md), Claude reaches 100% while Composer 2 reaches 95%. The agent harness and context retrieval may matter as much as the model.
For investors: For boilerplate CRUD work, a 2x cost reduction at 95% quality is compelling. For architecting payment systems or safety-critical code, developers consistently choose the best model regardless of price. The build-vs-buy decision is task-dependent, not universal.
Cursor didn't just wrap Kimi K2.5. They invested 4× the base model's compute in continued pre-training and RL.
Aman Sanger's tweet disclosed the key details: Cursor evaluated multiple base models on perplexity, chose K2.5, then applied "continued pre-training and high-compute RL (a 4× scale-up)."
The jump from K2 to K2.5 is not incremental. K2.5 is a full re-pretrain on 15 trillion mixed visual and text tokens (K2 tech report; K2.5 tech report). Cross-modal transfer from visual training boosted text-only coding benchmarks (MMLU-Pro 84.7 to 86.4). The 128K to 262K context expansion enables whole-codebase understanding.
What does "4× scale-up" mean in dollar terms? The phrasing is ambiguous. Here are both plausible readings:
Two readings of "4× scale-up"
Reading A — "4× on top": Cursor's CPT+RL = 4× the base cost. Total = $8.8M + $35.2M = ~$44M. Base is ~20% of total.
Reading B — "4× total": Total compute = 4× the base. Total = 4 × $8.8M = ~$35M. Base is ~25% of total. Matches "about a quarter" press reports better.
We present both; the press reporting of "about a quarter" for the base favors Reading B.
How we estimate training cost
Step 1 — Count the FLOPs. We extract architecture details directly from the K2.5 tech report: 32B active parameters, 15T training tokens, activation checkpointing enabled. Using the standard Epoch AI operation counting formula C = 8 × N_active × D, we get 3.84×1024 FLOPs. The 8× multiplier (vs standard 6×) accounts for activation checkpointing overhead.
Step 2 — Convert to GPU-hours. Dividing by the H800's effective throughput (693 TFLOP/s peak BF16 × 35% MFU) gives ~4.4M GPU-hours. MFU range for large MoE models on H800: 30–45%. DeepSeek V3 achieved ~40%.
Step 3 — Price it. At $2.00/GPU-hour — the same rate DeepSeek V3 used in their $5.576M cost disclosure — we get ~$8.8M for K2.5 pre-training. This rate represents amortized owned hardware, not cloud rental. (Tom Goldstein questions whether $2/hr is realistic.)
Cross-check: Applying this method to DeepSeek V3 yields ~$6.8M vs their self-reported $5.576M — within 22%. Sensitivity range at 25–45% MFU: $6.8M – $12.3M.
| Stage | Compute | Est. cost | How derived | Who paid |
|---|---|---|---|---|
| K2.5 pre-training | 3.84×1024 FLOPs | ~$8.8M | Operation counting from tech report: 8 × 32B × 15T | Moonshot AI |
| Cursor CPT + RL (4× scale-up) | ~1.0–1.5×1025 FLOPs | ~$26–35M | Inferred from Sanger's "4×" claim × $8.8M base (see two readings above) | Cursor via Fireworks AI |
| Total Composer 2 | ~1.4–1.9×1025 FLOPs | ~$35–44M | Base + 4× scale-up (Reading B: ~$35M; Reading A: ~$44M) | Moonshot ~20–25% / Cursor ~75–80% |
| Model | Reported / Est. cost | Source |
|---|---|---|
| DeepSeek V3 | $5.6M (reported) | arXiv 2412.19437 |
| Composer 2 (total) | ~$35–44M (est.) | Our estimate |
| Llama 3.1 405B | ~$53M (est.) | Our estimate |
| GPT-4.5 | ~$340M (est.) | Our estimate |
| Grok-4 | ~$388M (est.) | Our estimate |
Frontier model cost estimates are order-of-magnitude approximations based on publicly available architecture details and training compute. DeepSeek V3 is self-reported.
These are final-run costs only. They exclude research experiments, data curation, failed runs, and engineering staff. Moonshot CEO has stated: "It is hard to quantify the training cost because a major part is research and experiments." Sanger notes Cursor has trained about 50 models — total R&D spend is likely multiples higher.
The base is substitutable in theory — but switching means re-running the $26-35M RL pipeline on a new foundation, with no guarantee the recipe transfers. Interchangeability is real but expensive.
With confirmed benchmarks for all three stages — K2, K2.5, and Composer 2 — we can measure exactly what each step contributed:
At $2B+ ARR, even a partial model switch reshapes Cursor's economics — but the savings are smaller than they first appear.
Cursor reportedly surpassed $2B in ARR in early 2026. If inference costs consume 50% of revenue — a common ratio for AI-native products — that's $1B/year on model serving.
Using the fast pricing ($1.50/$7.50) — what users actually pay by default — the savings are real but more modest than the "standard" tier suggests:
The scenario that matters: At 80% Composer 2 fast traffic, gross margin improves from 50% to ~70% — a ~$400M annual improvement. Meaningful, but not the $622M figure you'd get using the standard tier price that most users don't pay.
We don't know Cursor's actual cost structure. These estimates assume 50% of $2B ARR goes to inference — a common ratio for AI-native products but unverified for Cursor specifically. Fireworks' margin (estimated 30-50% gross) is also embedded in Composer 2's pricing.
The margin stack
Each layer takes margin: Moonshot trained K2.5 (~$8.8M). Fireworks serves it (estimated 30-50% gross margin on hosting). Cursor sells to developers ($1.50/$7.50 fast, marked up from Fireworks' wholesale rate). The end-user price reflects three companies' economics, not one.
MoE models are cheap to run per token — but expensive to load into memory.
Kimi K2.5 activates only 32 billion of its 1.04 trillion parameters per token. That's extreme sparsity — 48× — which makes each token cheap to compute. But every one of those 1.04 trillion parameters must sit in GPU memory, ready for the router to select.
Imagine a hospital with 384 specialist doctors, but each patient only sees 8. You still need all 384 on staff — their salaries (GPU memory) are fixed — even though each consultation (token) only involves a fraction of them. The per-patient cost is low; the facility cost is high.
On H100s, you need 16–32 GPUs just to fit the model — and throughput is single-digit tokens per second. On next-gen Blackwell B300s, it's viable (1,876 tok/s at 64 users). But Hopper-era self-hosting is a losing proposition.
This is why Cursor went through Fireworks AI rather than self-hosting. Fireworks amortizes the GPU fleet across customers, making the economics work at a price point ($0.90–1.50/M tokens) that neither Cursor nor Moonshot could match alone.
Moonshot built the model. Cursor built the product. The gap in value capture is striking.
The Composer 2 story is a case study in open-source value dynamics. Moonshot invested ~$8.8M to pre-train K2.5 and open-sourced it. Cursor invested ~$26-35M in RL on top, and is generating $2B+ in ARR. Compare the two companies side by side:
| Moonshot AI Built Kimi K2.5 |
Cursor Built Composer 2 on top |
|
|---|---|---|
| Valuation | ~$18B (target, Series D) | $29.3B |
| ARR / Revenue | ~$500M (est.) 20 days post-K2.5 > all of 2025 |
$2B+ Doubling every 2–3 months |
| K2.5 investment | ~$8.8M Pre-trained the base model |
~$26–35M CPT + RL on top of K2.5 |
| Total raised | ~$2.5B | ~$2.5B |
| Team | ~300 (80 core tech) | ~100 |
| Founded | 2023, Beijing | 2022, San Francisco |
| Status | Private | Private |
Sources: Moonshot — 36Kr, KR-Asia, TechCrunch · Cursor — TechCrunch, Stripe, DevGraphiq. Moonshot revenue is approximate. Cursor headcount estimated.
At first glance, this looks like Cursor is winning. But look at who actually moves the needle on capabilities:
| Benchmark | Moonshot K2 → K2.5 ($8.8M) |
Cursor K2.5 → Composer 2 ($26–35M) |
|---|---|---|
| Terminal-Bench 2.0 | +23.0 pts | +10.9 pts |
| SWE-bench Multilingual | +25.7 pts | +0.7 pts |
| LiveCodeBench v6 | +31.3 pts | — (no data) |
| Avg gain per benchmark | +26.7 pts | +5.8 pts |
| Cost per point gained | $330K / pt | $4.5–6.0M / pt |
Moonshot is 14–18× more efficient at producing benchmark gains. For $8.8M, they moved every benchmark by 23–31 points. Cursor spent $26-35M and moved them by 0.7–10.9 points. This is partly diminishing returns (it's harder to go from 73 to 74 than from 47 to 73), but it also reflects the fundamental asymmetry: base model training is the hard, underpaid work.
The open-source value paradox: Moonshot does the heavy lifting — 14–18× more efficient at improving capabilities. But Cursor generates 4× the revenue. The market values the last mile (product, distribution, UX) over the first mile (model, research, training). This is the same dynamic that made AWS bigger than the Linux Foundation, and it's why open-source AI labs face a structural monetization challenge.
This maps directly to Yann LeCun's cake metaphor: pre-training is the bulk of the cake, supervised fine-tuning is the icing, and RL is the cherry on top. Nathan Lambert (Interconnects) adds context: "Post-training got more popular because there was more low-hanging fruit. A lot of that potential has been realized." The diminishing returns are showing up in Cursor's numbers — $4.5-6.0M per benchmark point vs Moonshot's $330K.
This doesn't mean Moonshot is losing. K2.5's open release triggered an overseas revenue explosion — 20 days of post-K2.5 revenue exceeded all of 2025. Open-sourcing is Moonshot's distribution strategy, not charity. And the value capture gap may narrow: if the base model is where the capability lives, then Moonshot controls the thing that matters. Cursor's RL is the cherry on top — and cherries are replaceable.
The model layer is becoming a commodity. The question is who captures the value that used to sit there.
| Who | Implication | Signal |
|---|---|---|
| Cursor / AI-native apps | Margin expansion, reduced vendor lock-in, model optionality | Positive |
| Inference providers (Fireworks, Together) | Growing demand as apps shift from API to hosted open-weight | Positive |
| Anthropic / OpenAI API revenue | Revenue concentration risk if top customers can switch at will | Watch |
| Open-weight labs (Moonshot, DeepSeek) | Ecosystem adoption, but limited direct monetization | Mixed |
The Cursor switch is not an isolated event. It's the first high-profile instance of a pattern that will repeat: AI-native companies evaluating open-weight bases, applying proprietary fine-tuning, and serving through specialized inference providers — cutting out the frontier lab's API entirely.
This framework extends beyond coding. Any company with (a) >$10M/year in API inference spend, (b) a proprietary data moat, and (c) access to ML talent could evaluate the same build-over-buy calculus. Legal AI firms fine-tuning on case law, medical AI companies training on clinical notes, financial AI platforms with proprietary trading data — all face the same question Cursor answered: is the frontier API model worth 2-5x the cost of a fine-tuned open-weight alternative?
The question for Anthropic and OpenAI is not whether their models are better. On most benchmarks, they still are. The question is whether "better" justifies 2-5× the price — and for how much longer.
Most of this story relies on claims from interested parties. Here's what third-party data actually shows.
| Signal | Data | Source | Date |
|---|---|---|---|
| Developer adoption | Cursor at 18% usage (vs Copilot ~42%) | Stack Overflow 2025, JetBrains 2025 | Survey 2025 |
| Revenue trajectory | $100M ARR → $500M → $1B → $2B+ | Stripe case study, TechCrunch | Jan '25, May '25, Nov '25, Feb '26 |
| Code volume | ~1B lines of accepted code/day | Aman Sanger (X) | 2025 |
| Enterprise signal | Salesforce: 20K engineers, >90% usage rate | Pragmatic Engineer | 2026 |
| Fireworks K2.5 pricing | $0.60/M input, $3.00/M output (serverless) | Fireworks AI | Mar 2026 |
| Composer 2 pricing | $1.50/M input, $7.50/M output (fast/default); $0.50/$2.50 (standard) | Cursor docs | Mar 2026 |
| Web traffic | cursor.com: #14 in AI tools, #3,004 globally | SimilarWeb | Oct 2025 |
Key gap: There is no public method to independently measure what percentage of Cursor requests go to Composer 2 vs Claude vs GPT. Cursor does not publish per-model usage breakdowns. The best proxies are developer surveys and community sentiment — which suggest developers use Composer 2 as the "default fast implementer" and switch to Claude/GPT for complex architectural work.