ai-models

A Negative Result for Activation-Based Prompt Selection Is Exactly the Kind of Benchmark Builders Need

Anatoliy Kolodkin

04 Jun 2026 • 4 min read

Negative results rarely get the attention they deserve because they do not come with a shiny chart saying “new method beats everything.” That is exactly why they are useful. A fresh arXiv paper on activation-based prompt selection delivers the kind of result builders should want before they accidentally turn a clever hypothesis into product infrastructure: the shortcut does not work, at least not well enough to trust.

The paper, Activation-Based Active Learning for In-Context Learning, tests a tempting idea. Transformer MLP activations encode rich internal features. In-context learning performance depends heavily on which examples you put in the prompt. Therefore, maybe activation statistics can identify which candidate examples will make good demonstrations without needing expensive downstream evaluation. It sounds elegant. It also sounds like the kind of thing that could become a vendor slide in about six weeks.

The authors test the idea using Llama-3.2-3B and Qwen2.5-3B base models across classification and generative tasks. The dataset list includes BoolQ, ARC-Challenge, OpenBookQA, and GSM8K. For each dataset, they sample roughly 1,000 candidate in-context examples from the training split, with ARC-Challenge using its full 1,119-sample training split because it is already close to that size. They focus on 1-shot prompting to keep the analysis fine-grained and computationally tractable.

The result is clean enough to be valuable: MLP activation-based active learning metrics do not meaningfully correlate with prompt-example quality. Across the tested tasks and models, absolute Spearman correlation is capped at 0.33, and many correlations sit near zero, including cases below 0.1. Translation: if you use these activation metrics to pick examples, you are often selecting with confidence rather than evidence.

The internal signal is real. The selector is not.

There are two ways to read this paper. The lazy read is “activations are useless.” That is too broad and probably wrong. The better read is that crude activation summaries are not enough to select useful prompt examples. The tested methods look at “massive activations” and the first four statistical moments of MLP activations. Those are attractive because they are easy to compute, easy to explain, and plausibly connected to internal model behavior. But easy-to-compute does not mean aligned with the task outcome you care about.

This is where the paper’s superposition hypothesis matters. Neural networks can represent many overlapping features in the same dimensions. A large activation might correspond to syntax, entity type, formatting, dataset artifact, rare token structure, or some latent feature that has nothing to do with whether the example will help the next answer. If that is true, then a simple statistic over dense activations is trying to use a blended internal signal as if it were a clean label. Builders have seen this movie before. It is feature engineering with a lab coat.

The study’s compute setup also makes the result more relevant to ordinary teams than a giant frontier-model experiment would. Most experiments ran on an NVIDIA L4 with batch size 1; GSM8K used an NVIDIA L40 with batch size 64 because generative math evaluation is slower. The models are 3B-class Llama and Qwen systems, exactly the sort of local or inexpensive models teams might reach for when they want private prompt pipelines, local agents, or cheap evaluation loops.

That is why the result matters. Local and small-model deployments often rely more heavily on prompt engineering than fine-tuning. Teams build prompt banks, few-shot selectors, retrieval systems, and routing rules because retraining is expensive or operationally annoying. A method that claims to choose better prompt examples from model internals would be extremely appealing. It could reduce validation cost, automate prompt-bank maintenance, and make few-shot pipelines feel less artisanal.

This paper says: measure before you believe it.

Prompt optimization still owes you downstream evidence

The practical lesson is not complicated. If a prompt optimizer claims to select demonstrations using “model internals,” ask the same questions you would ask of any other optimizer: does it beat random selection, nearest-neighbor retrieval, diversity sampling, uncertainty sampling, and a simple validation-set search? Does it survive model swaps? Does it work on your task distribution, or only on a benchmark where the answer format is unusually clean? Does it improve the metric that actually matters, or only a proxy that sounds mechanistic?

That last question is the trap. Mechanistic language can make weak evidence sound stronger than it is. “We use activations” sounds more scientific than “we picked examples that looked diverse,” but the production criterion is still downstream behavior. For a coding assistant, did the selected examples improve accepted patches? For a support bot, did they reduce escalations and hallucinated policy statements? For a math tutor, did they improve correctness on held-out problem types? For an agent, did they reduce retries, tool calls, and human review time?

Spearman correlations near zero are not a death sentence for all activation-based approaches. They are a warning about the simple ones. Sparse autoencoders, causal interventions, task-specific probes, or activation methods tied to verified outcomes may still become useful. But those methods need to earn their place against boring baselines. In production, boring baselines are not embarrassing. They are how you avoid shipping interpretability-flavored randomness.

The Qwen and Llama choice is also a subtle reminder that prompt infrastructure is model-specific. A selector that works on one model family may fail on another because the internal representation, tokenizer, pretraining mix, and instruction behavior differ. That matters for teams building model-routing systems or local-agent stacks. If your prompt selection layer only works for one model checkpoint, it is not a general optimization layer. It is a compatibility constraint.

There is an SEO-adjacent angle here for the Qwen/local-agent crowd, but the bigger story is methodological. The AI ecosystem is flooded with plausible shortcuts: use the model’s confidence, use hidden states, use activation magnitude, use agreement between agents, use self-critique, use judge scores. Some of those signals are useful. None of them get to skip validation. This paper is worth covering because it prevents a clean-sounding shortcut from getting undeserved trust.

For practitioners, the action item is refreshingly unglamorous. Keep a held-out set. Track downstream metrics. Compare against random and retrieval baselines. Re-run the evaluation when you swap models. Treat internal activations as hypotheses, not guarantees. If your prompt optimizer cannot prove it improves the task, delete the clever part and ship the boring one.

That may not make a great launch headline. It makes better software.

Sources: arXiv, In-context learning, prompt/example retrieval prior art, superposition background

The internal signal is real. The selector is not.

Prompt optimization still owes you downstream evidence

Sign up for more like this.