codex

The Goblin Ban in OpenAI's Codex Is a Window Into How Production AI Coding Agents Actually Work

Anatoliy Kolodkin

02 May 2026 • 4 min read

OpenAI published an unusually candid explainer on Thursday titled "Where the Goblins Came From" — a post that started as a joke on social media and turned into the most concrete public accounting the company has ever given of how reward signals in reinforcement learning can produce behavior that spreads beyond the contexts that originally trained it. The short version: a "Nerdy" personality option in ChatGPT was particularly generous with creature metaphors. Those metaphors then transferred. Then amplified. Then had to be explicitly banned in Codex's base instructions, repeated twice, because GPT-5.5 was already in training before anyone understood what was happening.

The post is worth reading in full. But the practical significance is not the goblins — it's what the goblin saga reveals about how instruction-level prompting actually works in production AI systems, and what that means for anyone building with or deploying coding agents.

The reward signal that learned to generalize

The mechanics are worth understanding precisely because they are not obvious. OpenAI's explanation is unusually detailed: the "Nerdy" personality was trained with a reward signal that scored outputs more favorably when they used playful, creature-based metaphors. That reward was applied only in the Nerdy condition, which accounted for roughly 2.5% of all ChatGPT responses. But reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them.

The evidence for this is stark. After GPT-5.1 launched, "goblin" mentions in ChatGPT responses rose 175%. "Gremlin" rose 52%. But Nerdy — the personality that triggered the reward signal — accounted for only 2.5% of responses yet 66.7% of all goblin mentions. The behavior was highly concentrated in the part of the system optimized for it. Then it wasn't.

OpenAI's audit found that the Nerdy reward signal scored outputs containing "goblin" or "gremlin" higher than outputs without those words in 76.2% of datasets tested. That created a feedback loop: playful style rewarded, creature tics appearing in rewarded examples, those examples appearing more often in rollouts, model-generated rollouts fed into supervised fine-tuning, the tic becoming more comfortable in the model generally. This is not a malfunction. This is RLHF doing exactly what it is designed to do — finding and amplifying patterns that produce favorable scores — with an unintended side effect.

Once GPT-5.5 training began, there was no stopping it. The model started training before OpenAI found the root cause. The "Nerdy" personality was retired in March, the problematic reward signal removed, training data scrubbed of creature-words. But GPT-5.5 was already in progress. When it started appearing in Codex, OpenAI's own employees noticed immediately. The fix was to add a developer-prompt instruction — buried in the base instructions file — telling the model not to mention those creatures unless obviously relevant.

Why this matters for anyone building with coding agents

The easy reading of this story is that OpenAI's model had a funny bug and fixed it. That reading is wrong in an important way. The goblin-ban is not a content filter in the conventional sense. It is not refusing to generate harmful content. It is an explicit behavioral override added to suppress a specific output pattern that emerged when the model was placed inside extended agentic sessions with additional system-level context layered on top.

The key sentence in OpenAI's own explainer: "Codex is, after all, quite nerdy." The model was being used inside a coding harness — with long-term memory, persona definitions, task-state management — and the "nerdy" creature metaphor tendency re-emerged even though the personality option had been retired. This is the part that should concern practitioners.

When you build with Claude Code, OpenAI Codex, or any other agentic coding tool, you are almost certainly layering additional instructions on top of the base model's context. AGENTS.md files, custom instructions, memory, tool definitions, persona prompts. The goblin saga is the clearest public evidence that those additional context layers can interact with the model's trained distribution in ways that produce surprising behavioral emergence — and that the fix OpenAI applied was not a better model, it was an explicit negative instruction at the instruction level.

This has direct implications for anyone evaluating or deploying coding agents in production. The model that writes clean, focused code in a 10-turn chat may behave differently in a 200-turn session with tool calls, memory, and system-level instructions layered on top. The creature-ban is a specific example of a general phenomenon: models in agentic loops can develop output patterns that require explicit suppression, not just guidance. If you are building internal prompt libraries, persona systems, or custom instruction sets for your team's coding agent, the lesson from this saga is that you may be inadvertently creating conditions that amplify behaviors the base model vendor did not intend — and you may not discover those behaviors until they have become entrenched in your training or fine-tuning data.

The instruction layer is now a first-class engineering surface

What OpenAI's explainer also reveals is that the company now treats instruction-level configuration as a normal, ongoing part of model deployment — not a one-time safety filter applied during training. The creature-ban appears twice in a 3,500-plus word base instructions document. It sits alongside instructions to never use destructive git commands without explicit user consent, to avoid emojis unless asked, and to follow the Codex harness's tool guidance. These are not edge cases. They are part of the product's documented behavioral contract.

For engineering leaders, this means the boundary between "what the model knows" and "what the instructions tell the model to do" is now a first-class engineering surface. When you evaluate coding agents, you are not just evaluating model quality. You are evaluating how well the vendor has identified and suppressed unwanted behavioral transfer from training, and how much latitude you have to add your own instruction layer without triggering unintended interactions. OpenAI explicitly notes in its explainer that if you want the goblins back, you can remove the instruction. That is both a transparency gesture and an implicit acknowledgment: the instruction layer is a configuration knob, not a fixed property of the model.

The deeper operational lesson is about the speed of model iteration versus the latency of behavioral discovery. GPT-5.5 was in training when the goblin problem was identified. The fix could not be retrofitted into the model — it had to be patched at the instruction level. For organizations that fine-tune models or build proprietary variants, this is a reminder that training contamination can take months to surface and months more to fully remediate. The instruction layer is the fastest-to-edit guardrail you have. Treat it accordingly.

The goblins are funny. The architecture behind them is not. OpenAI has published one of the most useful postmortems in the frontier AI era — not because the bug was interesting, but because it explains exactly how reward signal generalization works, why it matters for agentic deployments, and what it costs to fix after the fact. Anyone building seriously with AI coding tools should read it as a manual for what they are actually managing when they manage a model's instructions.

Sources: OpenAI, WIRED, Ars Technica

The reward signal that learned to generalize

Why this matters for anyone building with coding agents

The instruction layer is now a first-class engineering surface

Sign up for more like this.