An 86M-Parameter Arithmetic Model Is a Reminder That Training Data Shape Still Matters

An 86M-Parameter Arithmetic Model Is a Reminder That Training Data Shape Still Matters

There is a persistent bad habit in AI commentary: when a model fails at reasoning, someone immediately reaches for scale. Bigger model. More parameters. Larger context. More inference-time thinking. Sometimes that is the right answer. Often it is just the most expensive answer. A new arXiv paper on arithmetic pedagogy makes the opposite argument in a narrow but useful way: for some procedural skills, the shape of the training data matters as much as the size of the model.

Arithmetic Pedagogy for Language Models trains an 86M-parameter GPT-2-style decoder from scratch on arithmetic traces generated from the Indonesian GASING pedagogy. The model uses a tiny TOBA tokenizer with 284 tokens covering Indonesian syllabic units, numeric symbols, arithmetic notation, and formatting/control tokens. It trains with ordinary autoregressive next-token prediction over serialized chain-of-thought traces. No reinforcement learning. No reward model. No frontier-model budget hiding behind a neat figure.

The reported result is intentionally narrow: over 80% accuracy on 10,000 held-out arithmetic problems excluded from training. The training set contains 1,001,000 unique arithmetic problems, which the authors say is under 5% of the possible sample space. The architecture is plain enough to be almost provocative in 2026: 12 layers, hidden size 768, 12 attention heads of dimension 64, MLP hidden size 3072, and a maximum sequence length of 1280.

This is not a claim that an 86M model is suddenly a general reasoning engine. Good. The industry has enough exaggerated small-model victory laps. The useful claim is more specific: when the skill is narrow, procedural, and verifiable, curriculum and representation can do real work that people too often assign to parameter count.

Training traces are not just prettier prompts

The GASING angle matters because the paper is not merely prompting a model to “think step by step.” It is training the model on structured procedural traces aligned with left-to-right generation. That is a different intervention. Inference-time chain-of-thought asks a pretrained model to express reasoning. Training-time procedural traces teach the model which intermediate states belong in the computation and in what order they should appear.

That distinction is easy to miss and important for builders. Transformers do not execute algorithms because we named the dataset “reasoning.” They learn token transitions under a training objective. If the supervision sequence makes the computation causally legible — input, intermediate state, operation, next state, answer — the model gets a cleaner learning problem. If the dataset only contains inputs and final answers, the model may still learn shortcuts, memorized patterns, or brittle correlations.

The authors also include mechanistic analysis: attention-masking interventions on the chain-of-thought information graph, residual-stream probing, and logit-lens inspection. They describe three learning phases: surface pattern acquisition, procedural pathway internalization, and later associative “mental arithmetic” retrieval. The last phrase is easy to overread, so do not. It does not mean the model becomes a tiny mathematician. It means the model appears to move from explicitly following serialized procedures toward more direct retrieval-like behavior for some arithmetic patterns after enough exposure.

That progression is exactly what you would want from a narrow domain model. First imitate the format. Then internalize the procedure. Then compress repeated operations when the structure is familiar. Humans do something similar when they stop counting on fingers and start recalling products, though the implementation is obviously not the same. The point is not anthropomorphism. The point is curriculum.

Small models are not dead. Sloppy data is.

For engineering teams, the lesson transfers better than the specific arithmetic setup. Many enterprise AI tasks are narrow, repetitive, and partially procedural: invoice checks, insurance rules, compliance triage, form normalization, pricing workflows, log classification, test-case generation, support escalation, and internal code-review policies. Teams often throw generic instruction tuning at these problems and then wonder why a small model behaves like a confused intern with a JSON schema.

The better pattern is to encode the procedure. Show the intermediate states. Align them with the order the model must generate or verify. Hold out combinations that test generalization, not just examples that look different on the surface. Then probe whether the model learned the pathway or memorized the formatting. If the target behavior is verifiable, use that verification aggressively. Do not ask the model whether it understood the policy. Test whether it executes the policy on edge cases it never saw.

The tokenizer detail is not a footnote either. TOBA’s 284-token vocabulary is tailored to Indonesian/Austronesian syllabic structure plus numeric and arithmetic notation. Tokenization defines the primitives the model sees. In English-centric AI work, tokenizer choice is often treated as plumbing unless it breaks spectacularly. For small models, symbolic-heavy domains, and non-English workflows, it can decide whether the model spends its limited capacity learning the task or fighting the representation.

That should make teams more skeptical of generic fine-tuning recipes. If your domain has structured identifiers, chemical notation, legal citations, accounting codes, product SKUs, or non-English morphology, the default tokenizer may be a tax on every example. Larger models can sometimes brute-force their way through bad representation. Smaller models usually cannot. If you want small-model economics, you need small-model discipline.

The limitations are real. Arithmetic is clean, synthetic, and easy to grade. Most production domains are not. Real workflows have noisy labels, ambiguous requirements, changing policies, and examples that are wrong because a human was tired at 4:53 p.m. An 86M arithmetic model does not prove that every enterprise should train a tiny specialist from scratch. It does prove that scale is not the only lever, and in narrow domains it may not even be the first lever to pull.

There is also a deployment lesson for local-agent systems. A lot of agent stacks rely on a general local model plus prompts and tools. That can work, but it often leaves narrow repeated subtasks under-optimized. A small specialist trained on well-shaped traces may be a better component than a larger general model prompted into pretending it knows the procedure. Use the big model for ambiguity and synthesis. Use the small model for cheap, repeated, verifiable operations. Route between them like an engineer, not like a model leaderboard curator.

The industry’s default reflex is to celebrate scale because scale is easy to narrate. This paper is a reminder that training data has architecture too. The order of steps, the representation of tokens, the curriculum, the held-out split, and the verification loop all shape what a model can learn. For narrow reasoning skills, that may be the difference between “we need a bigger model” and “we need to teach the smaller one properly.”

That is the durable take: small models are not a nostalgia act. They are an engineering choice. But they only work when the data is designed with the same care people usually reserve for the model card.

Sources: arXiv, chain-of-thought prompting background, emergent abilities debate, small-model capability elicitation