AI Market Intelligence · March 2026

Kimi, Composer, and the Build vs Buy Dilemma

On March 19, Cursor shipped its new coding model on top of a Chinese open-weight base. The pricing gap — and what it says about the API business model — is the real story.

Cheaper than Claude Sonnet 4.6
at the default "fast" tier
$8.8M
Est. cost to pre-train Kimi K2.5
the open-weight base model
Scale-up in compute Cursor
applied on top of K2.5

What happened

A $29 billion company built its flagship model on a Chinese open-weight base — then forgot to mention it.

Cursor launched Composer 2 on March 19, 2026 to its 1M+ daily active users. The blog post credited "continued pre-training of a base model, combined with reinforcement learning." It did not name the base model.

Within 24 hours, a developer intercepted the model ID in Cursor's API responses: kimi-k2p5-rl-0317-s515-fast. The base was Kimi K2.5, an open-weight model from Moonshot AI in Beijing.

Within 72 hours, a developer discovered the model identity — and the pricing. Composer 2 charges $1.50/$7.50 per million tokens in its default "fast" mode. Claude Sonnet costs $3/$15. The gap was real, but narrower than it first appeared.


The competitive pricing landscape

Composer 2 is cheaper than Claude — but it's not the cheapest option. The full landscape tells a more nuanced story.

API pricing across the coding model landscape
Per million tokens, grouped by input & output. Sorted by input price. Log scale. Composer 2 / Kimi in blue.

Composer 2's standard tier ($0.50/$2.50) is not the default — the "fast" variant at $1.50/$7.50 is. At that price point, it's 2x cheaper than Sonnet, not 10x. And models like DeepSeek V3.2 ($0.28/$0.42) are already cheaper than Composer 2 with comparable coding scores.

The more relevant question isn't "is Composer 2 cheaper than Claude?" — it's "what else can you get at the same price point, and how does it compare?"

And yet, on coding benchmarks — the ones that actually predict IDE agent performance — Composer 2 trades blows with the expensive models. But first, a guide to which benchmarks actually matter:

BenchmarkIndependenceContamination riskTask realismImportance
SWE-bench Verified High HIGH — OpenAI retired it Feb 2026 citing contamination Low (median 4 lines) Declining
SWE-bench Pro High (Scale AI) Low High (107 lines, 4.1 files) Rising — recommended replacement
SWE-bench Multilingual High Medium Medium Medium
Terminal-Bench 2.0 High Low (refreshed) High (agentic) High
LiveCodeBench High Low (fresh problems) Medium High
Next.js Evals High (OSS, Vercel) Low High (framework migration) Medium — narrow but practical
CursorBench None (proprietary) Low (internal codebase) Very high (352 lines, 8 files, real user sessions) High methodology, low verifiability

CursorBench is arguably the best-designed benchmark on this list — real user sessions, agentic graders, refreshed quarterly, cross-validated against live traffic. The flaw is that only Cursor can run it.

Cursor explicitly chose not to report SWE-bench Verified, citing contamination — a position OpenAI now shares. But their absence from SWE-bench Pro (built by Scale AI specifically to fix Verified's problems) is harder to explain.

A quick guide to what each benchmark measures:

BenchmarkWhat it testsLink
SWE-bench VerifiedResolving real GitHub issues from 12 Python repos. 500 human-verified problems.swebench.com
SWE-bench MultilingualSame task, extended to JavaScript, TypeScript, Java, Go, Rust, C++.swebench.com
Terminal-Bench 2.0Agentic terminal tasks — shell commands, file ops, system admin. Tests tool use.tbench.ai
LiveCodeBench v6Fresh competitive programming problems. Tests raw code generation, not tooling.livecodebench.github.io
Aider PolyglotMulti-language code editing via chat. Tests edit-apply loop used by IDE agents.aider.chat
Competitive on coding benchmarks, despite the price gap
Higher is better. Sorted by score within each panel. Composer 2 and Kimi models in blue.
K2.5: HuggingFace, arXiv 2602.02276 · Composer 2: VentureBeat · SWE-bench: swebench.com · Terminal-Bench: tbench.ai · LiveCodeBench: livecodebench.github.io
Caveat: K2.5 benchmarks are self-reported from model card/paper. Not yet on public SWE-bench or Aider leaderboards.

Composer 2 beats Claude Opus 4.6 on Terminal-Bench 2.0 (61.7 vs 59.3). It loses on SWE-bench Multilingual (73.7 vs 77.5). The raw K2.5 base — before Cursor's fine-tuning — leads on LiveCodeBench v6 (85.0 vs 82.2).

Cursor also publishes scores on CursorBench, their proprietary internal evaluation suite. It uses real user sessions sourced via "Cursor Blame" (tracing committed code to agent requests), with tasks averaging 352 lines across 8 files — substantially larger than SWE-bench tasks.

ModelCursorBenchTerminal-Bench 2.0SWE-bench ML
Composer 261.361.773.7
Claude Opus 4.658.258.077.8
Composer 1.544.247.965.9
Composer 138.040.056.9

CursorBench sources tasks from real developer sessions via "Cursor Blame," which traces committed code back to the agent request that produced it. Tasks average 352 lines across 8 files — substantially larger than SWE-bench. Grading uses agentic judges and is cross-validated against live traffic metrics. The methodology is arguably best-in-class for measuring "does this model actually help developers in an IDE." The problem is purely verifiability — only Cursor can run it.
Source: cursor.com/blog/cursorbench. CursorBench is not public. A Cursor team member confirmed: "Unfortunately not, as we used our own internal code for the benchmark." Different evaluation harnesses were used per model, so cross-model comparisons are not apples-to-apples. Academic research warns that proprietary benchmarks "shift epistemic authority to the curator."

Next.js Evals: where the agent harness matters

Vercel's Next.js Evals (OSS on GitHub) test framework migration tasks — a practical, real-world workload. The results reveal how much the agent wrapper contributes on top of the raw model:

ModelAgentBaselineWith AGENTS.md
GPT 5.3 CodexCodex86%100%
GPT 5.4Codex86%95%
Composer 2Cursor76%95%
Gemini 3.1 ProGemini CLI76%100%
Claude Opus 4.6Claude Code71%100%
Claude Sonnet 4.6Claude Code67%100%
Kimi K2.5OpenCode19%52%

Source: nextjs.org/evals, github.com/vercel/next-evals-oss

K2.5 scores 19% baseline; Composer 2 scores 76%. That's +57 points from Cursor's RL — far larger than the +0.7 on SWE-bench ML. But with documentation (AGENTS.md), Claude reaches 100% while Composer 2 reaches 95%. The agent harness and context retrieval may matter as much as the model.

For investors: For boilerplate CRUD work, a 2x cost reduction at 95% quality is compelling. For architecting payment systems or safety-critical code, developers consistently choose the best model regardless of price. The build-vs-buy decision is task-dependent, not universal.


Three steps from open-weight base to production agent

Cursor didn't just wrap Kimi K2.5. They invested 4× the base model's compute in continued pre-training and RL.

Aman Sanger's tweet disclosed the key details: Cursor evaluated multiple base models on perplexity, chose K2.5, then applied "continued pre-training and high-compute RL (a 4× scale-up)."

The jump from K2 to K2.5 is not incremental. K2.5 is a full re-pretrain on 15 trillion mixed visual and text tokens (K2 tech report; K2.5 tech report). Cross-modal transfer from visual training boosted text-only coding benchmarks (MMLU-Pro 84.7 to 86.4). The 128K to 262K context expansion enables whole-codebase understanding.

What does "4× scale-up" mean in dollar terms? The phrasing is ambiguous. Here are both plausible readings:

Two readings of "4× scale-up"

Reading A — "4× on top": Cursor's CPT+RL = 4× the base cost. Total = $8.8M + $35.2M = ~$44M. Base is ~20% of total.

Reading B — "4× total": Total compute = 4× the base. Total = 4 × $8.8M = ~$35M. Base is ~25% of total. Matches "about a quarter" press reports better.

We present both; the press reporting of "about a quarter" for the base favors Reading B.

How we estimate training cost

Step 1 — Count the FLOPs. We extract architecture details directly from the K2.5 tech report: 32B active parameters, 15T training tokens, activation checkpointing enabled. Using the standard Epoch AI operation counting formula C = 8 × N_active × D, we get 3.84×1024 FLOPs. The 8× multiplier (vs standard 6×) accounts for activation checkpointing overhead.

Step 2 — Convert to GPU-hours. Dividing by the H800's effective throughput (693 TFLOP/s peak BF16 × 35% MFU) gives ~4.4M GPU-hours. MFU range for large MoE models on H800: 30–45%. DeepSeek V3 achieved ~40%.

Step 3 — Price it. At $2.00/GPU-hour — the same rate DeepSeek V3 used in their $5.576M cost disclosure — we get ~$8.8M for K2.5 pre-training. This rate represents amortized owned hardware, not cloud rental. (Tom Goldstein questions whether $2/hr is realistic.)

Cross-check: Applying this method to DeepSeek V3 yields ~$6.8M vs their self-reported $5.576M — within 22%. Sensitivity range at 25–45% MFU: $6.8M – $12.3M.

Stage Compute Est. cost How derived Who paid
K2.5 pre-training 3.84×1024 FLOPs ~$8.8M Operation counting from tech report: 8 × 32B × 15T Moonshot AI
Cursor CPT + RL (4× scale-up) ~1.0–1.5×1025 FLOPs ~$26–35M Inferred from Sanger's "4×" claim × $8.8M base (see two readings above) Cursor via Fireworks AI
Total Composer 2 ~1.4–1.9×1025 FLOPs ~$35–44M Base + 4× scale-up (Reading B: ~$35M; Reading A: ~$44M) Moonshot ~20–25% / Cursor ~75–80%
ModelReported / Est. costSource
DeepSeek V3$5.6M (reported)arXiv 2412.19437
Composer 2 (total)~$35–44M (est.)Our estimate
Llama 3.1 405B~$53M (est.)Our estimate
GPT-4.5~$340M (est.)Our estimate
Grok-4~$388M (est.)Our estimate

Frontier model cost estimates are order-of-magnitude approximations based on publicly available architecture details and training compute. DeepSeek V3 is self-reported.

These are final-run costs only. They exclude research experiments, data curation, failed runs, and engineering staff. Moonshot CEO has stated: "It is hard to quantify the training cost because a major part is research and experiments." Sanger notes Cursor has trained about 50 models — total R&D spend is likely multiples higher.

The base is substitutable in theory — but switching means re-running the $26-35M RL pipeline on a new foundation, with no guarantee the recipe transfers. Interchangeability is real but expensive.

With confirmed benchmarks for all three stages — K2, K2.5, and Composer 2 — we can measure exactly what each step contributed:

Each step in the pipeline adds measurable performance
Confirmed scores only. K2 → K2.5 (Moonshot's multimodal retrain) → Composer 2 (Cursor's RL). Opus 4.6 for reference.
K2 Instruct: HuggingFace model card, arXiv 2507.20534, tbench.ai leaderboard · K2.5: arXiv 2602.02276 · Composer 2: Cursor, VentureBeat
All scores confirmed from model cards, tech reports, or public leaderboards. K2 Terminal-Bench 2.0 score (27.8%) from tbench.ai (Terminus 2 agent). SWE-bench Multilingual K2 score (47.3%) from K2 tech report (agentic mode).

The margin math

At $2B+ ARR, even a partial model switch reshapes Cursor's economics — but the savings are smaller than they first appear.

Cursor reportedly surpassed $2B in ARR in early 2026. If inference costs consume 50% of revenue — a common ratio for AI-native products — that's $1B/year on model serving.

Using the fast pricing ($1.50/$7.50) — what users actually pay by default — the savings are real but more modest than the "standard" tier suggests:

How model mix affects profitability
Estimated gross margin at $2B ARR under three scenarios. Dashed line = 70% SaaS benchmark. Uses fast-tier pricing.
Cursor ARR: TechCrunch, SaaStr · Pricing: Cursor, Anthropic · SaaS benchmark: Bessemer
All figures illustrative. Actual cost structure not publicly disclosed.

The scenario that matters: At 80% Composer 2 fast traffic, gross margin improves from 50% to ~70% — a ~$400M annual improvement. Meaningful, but not the $622M figure you'd get using the standard tier price that most users don't pay.

We don't know Cursor's actual cost structure. These estimates assume 50% of $2B ARR goes to inference — a common ratio for AI-native products but unverified for Cursor specifically. Fireworks' margin (estimated 30-50% gross) is also embedded in Composer 2's pricing.

The margin stack

Each layer takes margin: Moonshot trained K2.5 (~$8.8M). Fireworks serves it (estimated 30-50% gross margin on hosting). Cursor sells to developers ($1.50/$7.50 fast, marked up from Fireworks' wholesale rate). The end-user price reflects three companies' economics, not one.


Why self-hosting doesn't work (yet)

MoE models are cheap to run per token — but expensive to load into memory.

Kimi K2.5 activates only 32 billion of its 1.04 trillion parameters per token. That's extreme sparsity — 48× — which makes each token cheap to compute. But every one of those 1.04 trillion parameters must sit in GPU memory, ready for the router to select.

Imagine a hospital with 384 specialist doctors, but each patient only sees 8. You still need all 384 on staff — their salaries (GPU memory) are fixed — even though each consultation (token) only involves a fraction of them. The per-patient cost is low; the facility cost is high.

The MoE memory wall
All 1.04T params must be in VRAM despite only 32B being active per token.
Architecture: K2.5 paper · H200 throughput: HuggingFace · B200: Simplismart · B300: Medium · Pricing: Lambda, Artificial Analysis

On H100s, you need 16–32 GPUs just to fit the model — and throughput is single-digit tokens per second. On next-gen Blackwell B300s, it's viable (1,876 tok/s at 64 users). But Hopper-era self-hosting is a losing proposition.

This is why Cursor went through Fireworks AI rather than self-hosting. Fireworks amortizes the GPU fleet across customers, making the economics work at a price point ($0.90–1.50/M tokens) that neither Cursor nor Moonshot could match alone.


Who captures the value?

Moonshot built the model. Cursor built the product. The gap in value capture is striking.

The Composer 2 story is a case study in open-source value dynamics. Moonshot invested ~$8.8M to pre-train K2.5 and open-sourced it. Cursor invested ~$26-35M in RL on top, and is generating $2B+ in ARR. Compare the two companies side by side:

Moonshot AI
Built Kimi K2.5
Cursor
Built Composer 2 on top
Valuation ~$18B (target, Series D) $29.3B
ARR / Revenue ~$500M (est.)
20 days post-K2.5 > all of 2025
$2B+
Doubling every 2–3 months
K2.5 investment ~$8.8M
Pre-trained the base model
~$26–35M
CPT + RL on top of K2.5
Total raised ~$2.5B ~$2.5B
Team ~300 (80 core tech) ~100
Founded 2023, Beijing 2022, San Francisco
Status Private Private

Sources: Moonshot — 36Kr, KR-Asia, TechCrunch · Cursor — TechCrunch, Stripe, DevGraphiq. Moonshot revenue is approximate. Cursor headcount estimated.

At first glance, this looks like Cursor is winning. But look at who actually moves the needle on capabilities:

Benchmark Moonshot
K2 → K2.5 ($8.8M)
Cursor
K2.5 → Composer 2 ($26–35M)
Terminal-Bench 2.0 +23.0 pts +10.9 pts
SWE-bench Multilingual +25.7 pts +0.7 pts
LiveCodeBench v6 +31.3 pts — (no data)
Avg gain per benchmark +26.7 pts +5.8 pts
Cost per point gained $330K / pt $4.5–6.0M / pt

Moonshot is 14–18× more efficient at producing benchmark gains. For $8.8M, they moved every benchmark by 23–31 points. Cursor spent $26-35M and moved them by 0.7–10.9 points. This is partly diminishing returns (it's harder to go from 73 to 74 than from 47 to 73), but it also reflects the fundamental asymmetry: base model training is the hard, underpaid work.

The open-source value paradox: Moonshot does the heavy lifting — 14–18× more efficient at improving capabilities. But Cursor generates 4× the revenue. The market values the last mile (product, distribution, UX) over the first mile (model, research, training). This is the same dynamic that made AWS bigger than the Linux Foundation, and it's why open-source AI labs face a structural monetization challenge.

This maps directly to Yann LeCun's cake metaphor: pre-training is the bulk of the cake, supervised fine-tuning is the icing, and RL is the cherry on top. Nathan Lambert (Interconnects) adds context: "Post-training got more popular because there was more low-hanging fruit. A lot of that potential has been realized." The diminishing returns are showing up in Cursor's numbers — $4.5-6.0M per benchmark point vs Moonshot's $330K.

This doesn't mean Moonshot is losing. K2.5's open release triggered an overseas revenue explosion — 20 days of post-K2.5 revenue exceeded all of 2025. Open-sourcing is Moonshot's distribution strategy, not charity. And the value capture gap may narrow: if the base model is where the capability lives, then Moonshot controls the thing that matters. Cursor's RL is the cherry on top — and cherries are replaceable.


What this means for investors

The model layer is becoming a commodity. The question is who captures the value that used to sit there.

Who Implication Signal
Cursor / AI-native apps Margin expansion, reduced vendor lock-in, model optionality Positive
Inference providers (Fireworks, Together) Growing demand as apps shift from API to hosted open-weight Positive
Anthropic / OpenAI API revenue Revenue concentration risk if top customers can switch at will Watch
Open-weight labs (Moonshot, DeepSeek) Ecosystem adoption, but limited direct monetization Mixed

The Cursor switch is not an isolated event. It's the first high-profile instance of a pattern that will repeat: AI-native companies evaluating open-weight bases, applying proprietary fine-tuning, and serving through specialized inference providers — cutting out the frontier lab's API entirely.

This framework extends beyond coding. Any company with (a) >$10M/year in API inference spend, (b) a proprietary data moat, and (c) access to ML talent could evaluate the same build-over-buy calculus. Legal AI firms fine-tuning on case law, medical AI companies training on clinical notes, financial AI platforms with proprietary trading data — all face the same question Cursor answered: is the frontier API model worth 2-5x the cost of a fine-tuned open-weight alternative?

The question for Anthropic and OpenAI is not whether their models are better. On most benchmarks, they still are. The question is whether "better" justifies 2-5× the price — and for how much longer.


What we can independently verify

Most of this story relies on claims from interested parties. Here's what third-party data actually shows.

Signal Data Source Date
Developer adoption Cursor at 18% usage (vs Copilot ~42%) Stack Overflow 2025, JetBrains 2025 Survey 2025
Revenue trajectory $100M ARR → $500M → $1B → $2B+ Stripe case study, TechCrunch Jan '25, May '25, Nov '25, Feb '26
Code volume ~1B lines of accepted code/day Aman Sanger (X) 2025
Enterprise signal Salesforce: 20K engineers, >90% usage rate Pragmatic Engineer 2026
Fireworks K2.5 pricing $0.60/M input, $3.00/M output (serverless) Fireworks AI Mar 2026
Composer 2 pricing $1.50/M input, $7.50/M output (fast/default); $0.50/$2.50 (standard) Cursor docs Mar 2026
Web traffic cursor.com: #14 in AI tools, #3,004 globally SimilarWeb Oct 2025

Key gap: There is no public method to independently measure what percentage of Cursor requests go to Composer 2 vs Claude vs GPT. Cursor does not publish per-model usage breakdowns. The best proxies are developer surveys and community sentiment — which suggest developers use Composer 2 as the "default fast implementer" and switch to Claude/GPT for complex architectural work.