vibe-coding

Three Teams, One Insight: How Kimi, Cursor, and Chroma All Converged on the Same RL Recipe for Agentic Models

Anatoliy Kolodkin

28 Mar 2026 • 1 min read

Something quietly significant happened in the past few months across three independent AI research teams. Moonshot AI building Kimi K2.5, Cursor building Composer 2, and Chroma building Context-1 all landed on the same training methodology — and none of them were coordinating. Start from a capable base model, run reinforcement learning rollouts inside the actual production environment, use outcome-based rewards, and scale parallel trajectories asynchronously. The convergence is striking enough to be its own signal about where agentic model training is heading in 2026.

But the convergence hides three very different engineering problems. Moonshot's team had to prevent "serial collapse" — their orchestrator would default to running sub-agents sequentially rather than in parallel, defeating the point of multi-agent design entirely. Their reward function explicitly penalizes this, alongside "spurious parallelism," where agents spawn workers without meaningful decomposition. Cursor's Composer 2 tackled a different failure: models that lose coherence over long coding sessions. Their solution was to train self-summarization alongside the task itself, letting the agent compress its own context. Chroma's Context-1 focused on RAG, teaching the model to actively prune retrieved documents mid-search rather than passively accepting whatever was handed to it.

Philipp Schmid at HuggingFace synthesizes all three in a concise technical comparison, and the reward hacking examples alone are worth the read — each team independently discovered that reward design is iterative and that the production harness must be simulated faithfully to get useful training signal. For anyone building or evaluating agentic models, this is the clearest picture yet of what the "standard RL recipe" for agentic systems looks like in practice.

Sign up for more like this.