Entropy-Cut Sampling Makes Reasoning Models Look Less Like RL Magic and More Like Search
The lazy story about reasoning models is that intelligence lives in the weights and inference is just the receipt printer. Reasoning with Sampling pushes against that. Its Entropy-Cut Metropolis-Hastings method makes reasoning look less like a mystical property unlocked only by RL posttraining and more like a search problem where the system needs to revisit the right branch point instead of burning tokens everywhere.
This matters because the cost crisis in agentic AI is not only about model size. It is about inference policy. Reasoning traces get long, samples multiply, retries stack up, verifiers add calls, tools add latency, and suddenly a task that looked cheap in a chat window becomes a small distributed system with a credit card attached. If you are going to spend extra compute on reasoning, you should spend it where the reasoning actually forked.
The paper builds on a simple premise: base models already assign some probability mass to good reasoning traces. Sampling can elicit those traces without necessarily changing the weights. The new contribution is where the sampler cuts. Instead of cutting uniformly at arbitrary token positions, Entropy-Cut uses next-token entropy from the base model as a proxy for consequential decision points. High entropy suggests the model is uncertain about what comes next; in a reasoning trace, that may correspond to choosing a proof strategy, an algorithm, a decomposition, or a scientific premise.
Stop resampling the punctuation
Uniform-cut methods can waste effort rewriting local suffix details. If the model picked the wrong proof route two paragraphs ago, resampling the last few algebra tokens is mostly theater. Entropy-Cut tries to return to the uncertainty spike where the strategy changed, then resample from there. The theoretical claim, in a stylized model, is that mixing time scales with the number of decisions in a trace rather than the number of tokens. That is exactly the abstraction production systems want: reason about decisions, not character count.
The benchmark spread is broad enough to be interesting: MATH500, HumanEval, GPQA Diamond, and AIME26. Models include Qwen2.5-7B, Qwen2.5-Math-7B, Qwen3-8B-Base, Phi-3.5-mini-instruct, and Phi-4-mini-instruct. The reported sampling setup uses maximum length T = 3072 and block size B = 192.
The gains are large on several open models. Qwen2.5-7B moves from standard scores of 35.9/33.0/29.4/2.0 on MATH500, HumanEval, GPQA, and AIME26 to Entropy-Cut MH scores of 71.9/68.9/30.2/9.4. Qwen2.5-Math-7B with Entropy-Cut reaches 79.0 on MATH500, 59.9 on HumanEval, 34.1 on GPQA Diamond, and 13.1 on AIME26. Qwen3-8B-Base reaches 80.2, 79.3, 40.0, and 10.3, beating the listed standard, low-temperature, SMC, TMC, and Uniform-Cut MH rows in most columns. Phi-4-mini-instruct posts 68.8/68.4/33.3/8.7, slightly above Uniform-Cut MH across all four tasks.
The practical lesson is not “sampling is free intelligence.” It is not free. Sampling spends tokens, increases latency, complicates stopping rules, and creates selection problems. The lesson is that inference policy is part of model capability. Before assuming you need a bigger hosted reasoning model, a new fine-tune, or an RL run, benchmark smarter decoding policies against your tasks. Sometimes the base model has the route in distribution; your system just keeps restarting from the wrong place.
For coding agents, this is especially relevant. Many failures are not local syntax errors. They are early strategy mistakes: choosing the wrong abstraction, editing the wrong file, writing a patch before reading the failing test, debugging symptoms instead of the call path, or deciding not to use a tool that would have resolved uncertainty. If your agent retries by asking the model to “try again” from the end of a bad trajectory, you may be paying to rephrase the same mistake. A decision-point retry policy is a better match for how engineering work actually fails.
This is where observability comes in. Teams should instrument reasoning traces, not just final answers. Track entropy spikes, tool-decision points, branch choices, verifier failures, retry locations, and whether a retry actually changes strategy. If a sampler is producing five variants of the same flawed plan, it is not search; it is expensive paraphrasing. If retries concentrated at high-entropy decision points produce different solution paths and higher pass rates, you have a cost-control lever.
There are caveats. Entropy is a proxy, not an oracle. High entropy can mean a meaningful strategic fork, but it can also mean ambiguity, formatting uncertainty, prompt underspecification, or benchmark noise. The paper reports benchmark accuracy, not production latency, dollar cost, or integration complexity. There is also no surfaced project repo in the metadata, and HN exact search found zero hits during research, so adoption signal is basically absent. Treat this as a method worth reproducing, not a drop-in product.
The broader implication is uncomfortable for leaderboard culture. If a base model plus a smarter sampler can approach or beat some posttrained behavior, then “reasoning” is not solely a property of the static model artifact. It is a property of the model plus inference algorithm plus budget plus selection policy. That makes evaluation messier, but it is closer to reality. Production AI systems are already stacks. Pretending the weights alone deserve all the credit is convenient and wrong.
The forward-looking take: reasoning systems should manage compute around decisions. Cut where the model was uncertain, resample where the plan forked, and measure whether the new branch actually helps. Spraying tokens at random suffixes and calling it intelligence was never going to age well.