ai-models

SARDI Turns Diffusion Language Models’ Discarded Tokens Into a Retrieval Signal

Anatoliy Kolodkin

05 Jun 2026 • 5 min read

Most retrieval-augmented generation systems wait until the model knows what it wants to say before asking the search layer for help. SARDI flips that around in a way that feels obvious only after someone writes the paper: the words a model is not confident enough to commit may still be good enough to search with.

Self-Augmenting Retrieval for Diffusion Language Models, or SARDI, is aimed at discrete diffusion language models rather than the left-to-right autoregressive models that dominate current RAG stacks. Diffusion language models generate by iteratively denoising a whole response. Early tokens are messy. Later tokens become stable. The SARDI idea is to treat those messy intermediate states as a retrieval signal instead of throwing them away as temporary noise.

That makes the paper more interesting than another “diffusion LMs might be faster” result. Parallel decoding is the obvious headline for diffusion language models. SARDI argues that the denoising trajectory itself is useful infrastructure. A half-formed answer can expose bridge entities, candidate relations, or missing facts before the final answer exists. If the retrieval system can use those speculative tokens without letting them leak into the committed output, it can fetch better evidence earlier.

Search with low confidence, answer with high confidence

The mechanism is clean. SARDI separates retrieval confidence from generation confidence. At each denoising step, tokens above a lower query threshold can be used to form a search query. Only tokens above a higher commit threshold are added to the output. In other words, the model can say, “I am not ready to assert this, but it is plausible enough to search around.”

That split is the useful conceptual move. Current RAG systems often collapse three jobs into one prompt: decide what you need, retrieve it, and answer. Static retrieval asks only the original question, which fails on multi-hop tasks where the useful query depends on an intermediate entity. Autoregressive dynamic retrieval, including FLARE-style lookahead, can generate tentative future text and search from that, but it is still hostage to a prefix. If the prefix goes wrong, retrieval follows the wrong trail. Diffusion decoding gives the model multiple tentative future tokens without the same hard left-to-right commitment.

The experiments use BM25 with K=7 passages per iteration and also test E5-base-v2 dense retrieval. The benchmark set is sensible for the claim: 2WikiMultiHopQA, HotpotQA, MuSiQue, CofCA, and SynthWorlds-RM. The last two matter because counterfactual corpora reduce the “the model memorized the answer during pretraining” problem that makes too many RAG benchmarks less informative than they look.

The reported gains are not subtle. On 2WikiMultiHopQA, static retrieval with the diffusion language model at commit threshold 0.9 scores 43.7 exact match in 0.46 seconds. SARDI at 0.9 scores 57.8 EM in 0.39 seconds, and SARDI at 0.95 reaches 59.1 EM in 0.56 seconds. On HotpotQA, SARDI moves from static DLM retrieval at 39.9 EM to roughly 48.5–48.7 EM. On MuSiQue, it moves from 11.1 EM to about 20.5–20.6 EM. The question-type breakdown is especially revealing: on 2Wiki, bridge-composition improves from 16.6% to 45.3%, and compositional inference from 14.0% to 37.5%.

Those are exactly the categories where question-only retrieval tends to fail. The first query does not contain the bridge entity because the model has not discovered it yet. SARDI uses the model’s intermediate denoising state as a preview of where the answer may go, retrieves evidence, and then lets later denoising steps commit more safely.

The throughput claim is less important than the scheduling lesson

The abstract claims performance at up to 8× higher throughput, but the part operators should underline is retrieval scheduling. Per-step retrieval sounds expensive. If every denoising step hits BM25, a dense retriever, a reranker, or a large production corpus, the retrieval layer can become the bottleneck faster than the model does.

SARDI’s ablation makes that less scary. Refreshing retrieval every two denoising steps costs only 1–2 EM in the reported table, and the paper says 83–90% of retrieved documents persist between consecutive steps. That is the production-shaped result. It suggests dynamic retrieval does not have to mean “search constantly.” It can mean “search when the speculative state has changed enough to justify it.”

For teams building RAG systems today, this is actionable even if they are not running diffusion language models. Retrieval timing is a design variable, not a fixed prelude to generation. You can cache retrieved evidence across turns. You can trigger refreshes only when the model introduces new entities or uncertainty spikes. You can distinguish evidence used for exploration from evidence allowed into the final answer. You can log which speculative queries led to useful documents and which just burned latency.

That last point matters because SARDI’s mechanism implies a better observability surface. Instead of a single opaque retrieval call at the top of the prompt, you get a trace of when the model thought it needed new evidence, what partial tokens prompted the query, and whether the retrieved documents persisted. RAG failures are often blamed on “the model hallucinated” when the real issue is that the system never searched for the right bridge fact. A dynamic retrieval trace gives engineers something more concrete to debug.

Do not rebuild your stack around this tomorrow

There are caveats. The code repository at github.com/pauljngr/SARDI exists but currently says “Coming soon,” with no releases and no meaningful adoption signal. Public practitioner reaction is essentially nonexistent. This is a research result, not a deployed ecosystem. The paper also notes an important limitation: the tested diffusion models do not reliably produce reasoning traces from prompting alone, so the training-free SARDI framework still depends on models that have been made reasoning-capable.

That means the immediate lesson is not “replace your autoregressive RAG model.” It is to borrow the separation of concerns. Speculative text can be useful for search before it is safe for output. Retrieval confidence and answer confidence should not be the same threshold. Evidence refresh should be scheduled, cached, and measured. And multi-hop systems need ways to retrieve after the model has formed partial hypotheses, not only before generation begins.

The strongest outside comparison is FLARE-style autoregressive retrieval, where a model generates predicted future text to guide search. SARDI’s advantage is conceptual cleanliness: diffusion models already maintain a field of tentative future tokens. The system does not need to force speculation through a brittle prefix. It can mine the denoising process for search terms while still holding the output to a higher bar.

My take: SARDI is the first diffusion-language-model RAG paper I would put in front of a practitioner without apologizing for it. Not because everyone should adopt diffusion LMs this quarter, but because it names a real RAG bug. The tokens a model is not confident enough to say may be exactly the tokens your retriever needs to see.

Sources: arXiv, SARDI GitHub repository, FLARE baseline, DREAM diffusion language model context

Search with low confidence, answer with high confidence

The throughput claim is less important than the scheduling lesson

Do not rebuild your stack around this tomorrow

Sign up for more like this.