PropMe Separates Training-Data Extractability From Normal-Use Leakage

PropMe Separates Training-Data Extractability From Normal-Use Leakage

Memorization debates in AI are usually too blunt to be useful. One side points to adversarial extraction and says models leak training data. The other points to ordinary product prompts and says leakage is rare. Both can be right, and that is exactly why one-number memorization audits are not good enough.

PropMe, a new arXiv paper from Gianluca Barmina, Peter Schneider-Kamp, and Lukas Galke Poech at the University of Southern Denmark, gives the distinction a cleaner vocabulary. The paper separates memorization capability from memorization propensity. Capability asks what a model can be forced to reproduce under adversarial or prefix-style prompting. Propensity asks what it tends to reproduce under ordinary prompt distributions. Those are different risk questions. Security teams need the first. Product teams, regulators, and users also need the second.

The accompanying SimpleTrace pipeline is the part builders should pay attention to. It uses infini-gram-style corpus tracing to attribute generated text back to training documents, then computes verbatim, near-verbatim, and propensity-transformed memorization metrics. The GitHub repository is not a placeholder: it documents indexing, unigram precomputation, generation, tracing, validation, propensity metrics, and experiment presets. That matters because the strongest memorization audit is not a classifier’s vibe. It is attribution against a corpus you can inspect.

Worst-case extraction is not the same as normal-use leakage

PropMe compares generic and specific non-adversarial prompts against prefix-style attacks. The evaluated settings include Comma and DFM Decoder on Common Pile and Dynaword, covering English and Danish data. SimpleTrace indexes the training corpus, traces generated spans back to matching source documents, and reports metrics such as average longest span, full-match ratio, and near-verbatim recall.

The numbers show why the capability/propensity split matters. For Comma on Common Pile, prefix attacks yield an average longest span of 50.35 tokens. Generic prompts produce 27.95, and specific prompts produce 29.47. Near-verbatim recall is 0.0321 under prefix prompting, compared with 0.0058 for specific prompts and 0.0013 for generic prompts. For DFM Decoder on Dynaword, near-verbatim recall reaches 0.0363 under prefix prompting versus roughly 0.0010 generic and 0.0007 specific — about a 36× gap.

That gap is the story. A model may be extractable under attacker-shaped conditions without routinely leaking training text in normal use. But that does not let vendors declare victory. Prefix extraction is still a real capability. If a model can be induced to reproduce training data, an adversary will not politely use the same prompts as your product onboarding flow. Low propensity is not a replacement for red-team testing. It is a second column on the same risk dashboard.

The reverse is also true. Worst-case extraction numbers should not be marketed as if they describe ordinary product behavior. If a benchmark says a model can leak a passage when handed a long prefix from that passage, that is important, but it is not the same as saying users will encounter that passage during normal use. Risk management needs both measurements: the red-team ceiling and the everyday tendency.

PropMe’s propensity scores remain low overall in the reported settings. For DFM Decoder on Dynaword, PM_NVR is 0.0263 for generic prompts and 0.0182 for specific prompts; PM_FMR reaches at most 0.125 under specific prompts. The paper also reports a continual-pretraining finding: DFM Decoder, trained further from Comma with a mixture emphasizing Dynaword, shows reduced Common Pile memorization. Comma has Common Pile prefix average longest span of 50.35, while DFM Decoder has 40.83, and Comma is the only one with non-zero Common Pile full-generation memorization in the table.

If you trained on it, index it

The practical lesson is almost embarrassingly direct: if you trained or fine-tuned on a known corpus, build an attribution index for it. Generate under multiple prompt distributions. Trace outputs back to source documents. Track span length, full matches, near-verbatim recall, and how those metrics move across model versions. If your safety report cannot say which training document a suspicious output matched, it is not really a memorization audit. It is a suspicion with charts.

That applies beyond foundation-model labs. Enterprises fine-tuning models on internal docs, support tickets, repositories, incident reports, or customer data should care. Coding-agent vendors should care even more. Repository-specific models, adapters, caches, and agent memories may improve usefulness, but they also create more places where private code can be retained and later surfaced. If a model suggests a test assertion copied from a private customer repo, the relevant question is not only “did it memorize?” It is “from where, under what prompt conditions, how often, and can we reproduce the trace?”

SimpleTrace’s validation results are encouraging: the paper reports 0.99 document retrieval and exact match on sampled Common Pile queries and 1.00 on sampled Dynaword queries. The tool supports JSONL inputs, generation JSON, traced-span JSONL, summaries, span-length distributions, and a mixed tracing mode for code, markup, equations, and structured text. That is the right shape for audits that have to work on the messy material models actually train on.

The obvious failure mode is prompt distribution theater. Propensity depends on the prompts you choose. A vendor can select tame prompts, report low ordinary-use leakage, and quietly ignore sharper but still realistic user behavior. The useful version of PropMe requires transparent prompt suites, domain-specific scenarios, and adversarial checks alongside ordinary-use sampling. A customer-support model, a code assistant, and a legal drafting model should not share the same “normal prompt” distribution just because it is convenient for a benchmark.

The industry has spent too much time arguing whether memorization is catastrophic or overblown. PropMe suggests a better question: which kind of memorization are we measuring? Capability tells you what an attacker might force. Propensity tells you what the product tends to do. Attribution tells you whether the suspicious text actually came from the corpus. Put all three together and you get something closer to engineering. One scary number or one reassuring number is not enough.

Sources: arXiv, PropMe GitHub repository, infini-gram, OLMoTrace context