Codex /goal Is the Background-Agent Feature That Needs a Definition of Done Before It Needs More Autonomy

Codex /goal sounds like another autonomy feature. It is better understood as a forcing function for engineering discipline. The feature lets Codex keep working across turns toward a durable objective, but the useful part is not that the agent can run for hours. The useful part is that OpenAI’s own docs keep returning to the same unglamorous requirement: define “done” before the agent starts.

The new Codex use-case page describes /goal as an experimental CLI feature for long-running work with a clear success condition and validation loop. Users can enable it from /experimental or by adding goals = true under [features] in config.toml. Then they set an objective with /goal <objective>, inspect the current goal with /goal, and control the run with /goal pause, /goal resume, or /goal clear. OpenAI says Codex can work independently for multiple hours and stop when it is fairly confident it has reached the stopping condition.

That last sentence is both the promise and the risk. “Multiple hours” is a productivity win when the work is scoped. It is a budget and review problem when the work is vague. A background agent with no crisp stopping condition is not autonomous engineering. It is a recursive TODO list with shell access.

Goals are executable tickets, not wishes

OpenAI’s docs say a good goal is bigger than one prompt but smaller than an open-ended backlog. That is the correct boundary. “Modernize the frontend” is not a goal. “Migrate these five routes to the new router, preserve existing contract tests, keep the legacy route behind a flag, and stop when npm test -- router plus the Playwright smoke suite pass” is a goal. The difference is not prompt polish. It is operational safety.

This is where teams should borrow from good engineering ticket design. A usable goal names the objective, non-goals, files or docs to read first, allowed scope, validation commands, rollback expectations, and stop conditions. It should tell Codex what not to change with the same clarity as what to change. The docs explicitly recommend pointing Codex at the files, issue, logs, or plan it must read first; defining commands or artifacts that prove progress; asking for checkpointed progress logs; and pausing or clearing the goal when the run is done, blocked, or changing direction.

That guidance matters because long-running agents fail differently from chat agents. A bad one-turn answer is visible immediately. A bad three-hour goal can produce a broad diff that looks coherent but encodes a misunderstanding from the first ten minutes. The longer the loop runs, the more important it becomes to constrain the loop with tests, checkpoints, and stop conditions that do not depend on the agent’s vibes.

The status update is part of the artifact

OpenAI says useful /goal status updates should name the current checkpoint, what was verified, what remains, and whether Codex is blocked. That is not cosmetic. It is how humans keep a background run reviewable without watching every keystroke. If a status update says “working on improvements,” the goal is already drifting. If it says “checkpoint 2/5: migrated route parser, verified contract tests pass, next is auth middleware parity, blocked on missing fixture for expired token case,” a reviewer can make a decision.

Teams should standardize that format. Every background-agent run should leave behind a short progress log with checkpoints, commands run, test outcomes, files touched, assumptions made, and reasons for stopping. That log should live next to the diff or be pasted into the pull request description. Otherwise reviewers inherit a pile of changes and no trail of how the agent got there.

This is also where cost becomes visible. Codex pricing uses five-hour usage windows, and local messages, cloud tasks, model choice, image generation, subagents, and code review all shape consumption. Multi-hour loops are exactly where hidden costs show up: repeated test runs, broad context reads, failed attempts, and status chatter. The right metric is not just “did the task complete?” It is “how much rework did the final diff require per hour of agent runtime?” A cheap-looking background run that produces a diff humans spend a day unwinding is not cheap.

Autonomy needs a smaller permission set, not a bigger one

There is a tempting but wrong instinct: because a goal runs longer, give it more access so it does not get stuck. That is backwards. Longer-running work needs tighter boundaries because it has more time to encounter hostile input, stale assumptions, weird dependency scripts, and ambiguous failures. Use read-only or conservative approval modes when the task is exploratory. Use broader permissions only inside disposable environments or when validation requires it. Keep git clean before starting so the final diff is attributable. Require approvals for network access, destructive commands, protected paths, secret reads, and any action that crosses from code editing into operational state.

The best /goal candidates are chores with verifiable loops: migrations, codemods, test hardening, prompt optimization against an eval suite, dependency updates with compatibility checks, prototype polish, and deployment retry loops where the stopping condition is mechanical. The worst candidates are product strategy, sweeping refactors without tests, security-sensitive rewrites, and anything whose success depends on taste but has no review checkpoints.

There is also a market signal here. Coding tools are splitting into surfaces. IDE agents own the tight edit loop. Cloud tasks own async delegation. Remote connections move work onto the host where the build actually runs. Browser agents operate inside logged-in web apps. /goal tries to own durable local autonomy: keep iterating until a defined objective is satisfied. That is a powerful primitive, but it is not magic. It is only as good as the contract it is given.

Practitioners should build a small goal template now, before everyone invents their own bad version. Include: objective, non-goals, repo state, files to read first, allowed files, validation commands, checkpoint format, pause conditions, expected final artifacts, and reviewer notes. Ask Codex to restate the goal before starting. If the restatement is wrong, do not let the run begin. This is not bureaucracy. It is how you prevent a background agent from spending three hours faithfully executing the wrong interpretation.

The editorial take is simple: /goal is not the feature that eliminates planning. It is the feature that punishes weak planning faster. Background agents are useful when goals become executable specs with tests, not motivational posters for a robot with a terminal.

Sources: OpenAI Developers, Codex CLI features, Codex pricing, Agent approvals and security, Codex IDE features