RL Translation Paper Says Models Need to Learn How to Use the Grammar Book, Not Memorize the Language

RL Translation Paper Says Models Need to Learn How to Use the Grammar Book, Not Memorize the Language

The most interesting part of this low-resource translation paper is not that reinforcement learning beats supervised fine-tuning in one table. It is the shape of the failure it exposes: supervised fine-tuning can make a model look better on the languages it has already seen while making it worse at the thing builders actually wanted — using new evidence at inference time.

That is a much broader lesson than translation. It is the same bug that shows up when a support bot ignores the policy document you retrieved, when a coding agent steamrolls a repository convention because it has seen a thousand similar projects, or when a RAG system confidently answers from prior distribution instead of the supplied context. The model is not failing because the context window is too small. It is failing because training taught it that the safest path is pattern completion, not evidence use.

The paper, “Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation”, attacks that problem in an unusually clean setting. The authors train Qwen3-4B-Base and Llama-3.2-3B-Instruct for extremely low-resource translation, conditioning the models on rich linguistic context: a language introduction, task instruction, two dictionary entries per source token, three or five parallel examples, two grammar passages, and an instruction asking for step-by-step meta-linguistic reasoning. The reward is not elaborate. They use chrF, a surface-level character n-gram translation metric, rescaled to the [0,1] range under GRPO.

That simplicity is the point. If a lightweight outcome reward can push a model toward better use of dictionaries, examples, and grammar passages, then outcome-based RL is not only a math-and-code trick. It may be a way to train the meta-skill of exploiting context.

The useful result is not “RL beats SFT”

The dataset spans 18 languages, 26 translation directions, 10 language families, 23,587 training pairs, and 2,699 test pairs. The training setup includes seen Romansh varieties plus seven other low-resource languages. Evaluation separates the comforting case — languages similar to those seen during training — from the uncomfortable one: unrelated, unseen language families.

On seen Romansh-to-German translation with full context, supervised fine-tuning wins. For Qwen, SFT scores 0.60 chrF while RL scores 0.52. That is the expected result: if the test distribution resembles the training distribution, cross-entropy against reference outputs is a strong way to internalize useful regularities.

But on five unrelated unseen languages, the result flips. Qwen RL averages 0.27 chrF, compared with 0.09 for SFT and 0.18 for the base model. Llama shows the same pattern: RL reaches 0.24 on unseen languages, while SFT lands at 0.09 and the base model at 0.14. SFT made the model better at what it had already learned and worse at using new linguistic evidence when the distribution changed.

That distinction matters. The practical question is not whether RL has a higher aggregate number than SFT. The question is what behavior training elicits. SFT says, “produce the reference answer for this kind of input.” RL says, “find a strategy that gets rewarded under the information available now.” In a task where the model must use a grammar book and dictionary it has never seen before, those are very different instructions.

Context is the product surface

The ablations are where the paper becomes useful for people building systems rather than reading leaderboards. Qwen RL drops from 0.52 to 0.30 on seen languages when retrieval context is removed. Qwen SFT also drops, from 0.60 to 0.46, but the key finding is that RL’s unseen-language advantage appears only when context is available. RL did not magically memorize all possible languages. It learned to lean on the supplied linguistic material.

The kind of context also matters. Removing the bilingual dictionary drops seen Romansh from 0.5324 to 0.4483 and English-to-Kalamang from 0.3464 to 0.2626. Removing parallel examples costs roughly seven chrF on English-to-Kalamang. Removing grammar barely moves seen Romansh, from 0.5324 to 0.5249.

That is an awkward result for anyone whose “domain adaptation” strategy is pasting a long manual into the prompt and hoping attention does the rest. Grammar prose may be valuable to humans, but in this setup the model benefits more from structured, task-adjacent artifacts: dictionaries and examples. The analogy for software teams is direct. API contracts, validated snippets, schemas, dependency graphs, and worked examples are usually more machine-usable than a wiki page written for onboarding humans. If you want models to use context, package the context like an interface, not like a PDF.

There is also an ethical and operational angle here. Extremely low-resource languages will not all get the parallel corpora needed for conventional fine-tuning economics. Some communities may have small dictionaries, grammar notes, field linguistics material, or a handful of curated examples. A method that trains models to use that material at inference time is more plausible than pretending every language can be turned into a high-resource benchmark.

But that same point raises the deployment bar. Production translation for endangered, Indigenous, or under-documented languages cannot be evaluated only with chrF. It needs native-speaker review, dialect sensitivity, consent around data use, and an explicit plan for hallucinated morphology. A model that produces fluent-looking wrong translations can do real damage, especially when language work intersects with education, legal access, health care, or cultural preservation.

What engineers should do with this

The immediate action is not to replace every SFT pipeline with RL. It is to ask whether your adaptation task is actually a context-use task. If the model will see the same distribution at inference time that it saw during training, SFT may be the right tool. If the model must use fresh evidence — a customer’s private docs, a changing codebase, a newly retrieved policy, an unfamiliar schema, or a low-resource language grammar — then measuring performance only on familiar held-out examples is probably lying to you.

Teams should add evaluations that deliberately separate memorized familiarity from contextual generalization. Hold out entire domains, repositories, languages, API families, or document styles. Test with and without retrieval context. Ablate the context types. If the model barely changes when the evidence is removed, congratulations: you built a very expensive autocomplete system with citations.

The paper also argues for training objectives that reward grounded outcomes, not just reference imitation. In translation, chrF gives a cheap scalar reward. In software, that might be tests passing, typechecks succeeding, linters staying clean, or a patch applying correctly against the actual repository. In support, it might be policy-compliant resolution under adversarially varied documents. The broader pattern is to reward the system for using the environment, not for sounding like prior examples.

The limitation is real. This is a controlled research setup with small backbones, specific prompts, and a task where automatic reference scoring exists. It does not prove that RL will fix every RAG or agent system that ignores context. It does, however, give a crisp diagnosis: SFT can overfit the answer distribution and under-train the evidence-use behavior. That is a failure mode practitioners can test for today.

The best read on this paper is that context windows are not enough. Retrieval is not enough. The model has to be trained, evaluated, and rewarded for treating context as authoritative working material. Otherwise, it will keep doing what language models do by default: complete the pattern it already knows and call that reasoning.

Sources: arXiv, code repository, MTOB benchmark context, MT-R1-Zero context