ai-models

SearchSwarm Treats Delegation as a Trainable Model Skill, Not Just an Agent Framework Trick

Anatoliy Kolodkin

09 Jun 2026 • 4 min read

Most multi-agent products treat delegation as plumbing. There is a main agent, a few subagents, a queue, a summary format, and a prompt that says something like “delegate when useful.” That works about as well as telling a junior engineer to “break down the project” without teaching them what a good breakdown looks like. SearchSwarm’s useful idea is that delegation is not just a framework feature. It is a model skill.

The paper targets long-horizon deep research, where a task can exceed finite context windows and simple browsing loops run out of attention before they run out of web pages. The proposed system builds harness-guided delegation trajectories: examples of when to split work, what to ask subagents, how much context to send, and how to integrate returned summaries. Those traces are then used as supervised fine-tuning data for SearchSwarm-30B-A3B. The authors report 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, which they claim is best-in-class among comparable-scale models.

Long context does not remove the need for management

SearchSwarm lands because it attacks a problem that large context windows do not solve. Deep research tasks are not merely long. They branch. The agent needs to follow multiple hypotheses, collect contradictory evidence, drop weak leads, preserve citations, and synthesize without losing the original question. A million-token context window can hold more debris, but it does not automatically decide which debris matters.

Humans delegate for the same reason. Not because one person has no memory, but because scoped work and bounded reports are more reliable than one person carrying every detail at once. Good delegation has structure: define the subproblem, state constraints, identify acceptable sources, say what not to spend time on, request evidence in a usable format, and integrate the answer without blindly trusting it. That is a cognitive skill. SearchSwarm’s premise is that models can learn some of it from traces rather than relying entirely on runtime prompt tricks.

For builders, this is a useful reframing. If your agent framework supports subagents but your top-level model does not know when to use them, you have parallelism theater. You can burn more tokens, run more searches, and get more summaries while still failing to answer the question. The bottleneck becomes the delegation policy: what work gets split, what evidence gets returned, and how the main agent budgets attention across competing paths.

BrowseComp is the right kind of uncomfortable

The reported BrowseComp number is attention-grabbing because BrowseComp was designed to punish shallow search. OpenAI described it as 1,266 hard-to-find browsing problems with short, verifiable answers. In OpenAI’s launch evaluation, human trainers solved 29.2% of problems, GPT-4o with browsing scored 1.9%, o1 scored 9.9%, and Deep Research scored 51.5%. Against that backdrop, a 30B-A3B model reporting 68.1 is notable if the setup holds under independent scrutiny.

That “if” is doing real work. Agent benchmarks are notoriously sensitive to browsing tools, retry budgets, timeouts, answer normalization, search providers, and hidden scaffolding. A BrowseComp score without a clear tool budget and trajectory trace is a partial fact. It may be impressive, but it is not fully operational evidence. The SearchSwarm authors say they will release the harness, model weights, and training data. They should, because delegation behavior is otherwise impossible to audit from final answers alone.

This is where the broader benchmark reproducibility problem comes back into frame. For deep research agents, the path matters as much as the answer. Did the agent spawn ten workers or two? Did workers search independently or duplicate effort? Did the main agent verify returned claims? Did it preserve source provenance or compress away the part that made the answer checkable? These are not implementation details. They are the product.

The product pattern is trace collection

The immediate value for engineering teams is not necessarily to fine-tune their own SearchSwarm clone tomorrow. It is to start treating delegation traces as first-class data. When a senior analyst or engineer handles a messy research task, capture how they split it. What context did they give a helper? What source types did they prefer? What did they explicitly rule out? What did a useful returned note contain? What made a summary unusable?

Those traces can serve three purposes at once. They are training data if you later fine-tune or preference-train a delegation policy. They are eval data if you want to test whether a new model decomposes work sensibly. And they are product requirements because they reveal the interfaces your subagents need. Most teams already log final answers. Fewer log the failed decomposition that made the final answer weak. That is the part worth keeping.

There is also a cost-governance angle. Delegation can reduce context pressure, but it can also turn one expensive agent run into a distributed token bonfire. A main agent that spawns workers for every ambiguity is not intelligent; it is avoiding prioritization. Good delegation should be budget-aware. It should estimate whether a subtask is independent, whether the answer can be verified, whether the expected information gain justifies the cost, and whether a cheaper model can handle the branch. SearchSwarm’s framing makes that evaluable.

The deep research market is full of products that look impressive because they produce long reports with citations. The next quality bar is whether they know how to investigate. That means delegation literacy: splitting the right questions, preserving evidence, avoiding duplicate work, and synthesizing without laundering uncertainty into confident prose. Long-context models will keep improving, but attention is still a scarce resource. The winning systems will manage it, not merely expand it.

SearchSwarm is preliminary until the promised artifacts are public and independent teams can reproduce the numbers. Still, the direction is right. Subagents are not magic workers hiding behind a button. They are an organizational structure. Models need to learn how to manage that structure, and builders need to evaluate the management behavior directly. Otherwise, “multi-agent” remains what it too often is today: one confused agent with more tabs open.

Sources: arXiv, OpenAI BrowseComp, arXiv HTML

Long context does not remove the need for management

BrowseComp is the right kind of uncomfortable

The product pattern is trace collection

Sign up for more like this.