A Fresh OpenClaw Cron Bug Shows Why Agent Schedulers Need Lane Isolation, Not Just Health Checks

A Fresh OpenClaw Cron Bug Shows Why Agent Schedulers Need Lane Isolation, Not Just Health Checks

The most useful OpenClaw bug reports are not the spectacular ones. They are the ones where everything looks healthy until a user notices the agent stopped behaving like an agent. GitHub issue #81234, opened just after 01:00 UTC on May 13, is one of those. After upgrading a beta host to 2026.5.12-beta.3, Discord direct messages stopped receiving assistant replies even though deep status checks still reported the gateway and Discord as healthy. At the same time, multiple cron jobs timed out after turn-accepted.

That pairing is the whole story. Agent platforms are no longer single-threaded chatbots waiting for one prompt at a time. They answer humans in live channels, run scheduled jobs, call model harnesses, spawn subagents, manage task registries, and keep session state across restarts. Once all of that shares a runtime, a green health check can be technically true and operationally useless.

The reporter’s environment was OpenClaw 2026.5.12-beta.3 on macOS arm64 with Node 25.9.0, Gateway LaunchAgent, and Discord, Telegram, and WhatsApp enabled. The Discord symptom was not a broken token or dead gateway. The reporter found stale cron job state using a Discord direct-message session key shaped like agent:<agent>:discord:direct:<dm-id>. Clearing those stale sessionKey fields from affected cron jobs and restarting the gateway restored Discord DM responsiveness.

That is a small mitigation with a large architectural smell. A scheduled job should not be able to contaminate or occupy a human-facing Discord DM lane just because some stale session metadata survived an upgrade or migration. If a cron job is configured as isolated, it should get an isolated lane. Not “usually isolated unless an old field points somewhere else.” Isolation that depends on perfect historical metadata is not isolation; it is optimism serialized to JSON.

“Healthy” is not the same thing as able to reply

The Discord failure exposes a monitoring gap that every agent runtime will have to close. Traditional service health answers questions like: is the process up, can the gateway respond, is the adapter connected, can the bot authenticate with Discord? Those checks matter. They do not answer the question the user cares about: can this inbound message acquire the right session, run a turn, and produce a visible reply?

For agent platforms, liveness needs to move up a layer. A channel adapter can be alive while the reply lane is wedged. A model provider can be reachable while a harness timeout retires the active client. A cron scheduler can accept a turn while the actual execution stalls behind a lock, fallback path, or context-maintenance delay. The system is not healthy merely because its components have pulses. It is healthy when the end-to-end path can still complete the work it claims to do.

This is why issue #81234 matters beyond its beta label. The specific stale-key bug may be narrow. The class is not. Any runtime that lets scheduled jobs and live chat share session infrastructure needs hard invariants around lane ownership. Cron jobs should not reuse direct-message lanes unless explicitly configured to do so. Session keys should be normalized and migrated on startup. Doctor checks should flag cron jobs whose stored sessionKey conflicts with their declared target. Gateway startup should warn when background work points at live channel sessions in a way that can block human-facing replies.

turn-accepted is where observability goes to die

The second failure mode in the report is cron timeouts after turn-accepted. Logs included messages such as codex app-server client retired after timed-out turn, Profile openai-codex:<redacted> timed out. Trying next account..., and fallback decisions with sourceProvider openai, sourceModel gpt-5.5, and timedOut true. Before mitigation, jobs used OpenAI/Codex-family models including openai/gpt-5.5, openai/gpt-5.4-mini, and gpt-mini. After mitigation, all 35 visible cron jobs were pinned to anthropic/claude-opus-4-7, and a previously failing high-frequency sync job succeeded in 36,916 milliseconds.

That workaround is useful. It is not a root-cause verdict. Moving scheduled jobs away from one model family may avoid a timeout path, but the underlying failure could live in provider latency, Codex app-server lifecycle, model alias resolution, fallback exhaustion, session locking, or context-engine work. The report is careful on this point, and the platform should be too. “It worked after rerouting to Anthropic” is evidence for an operator workaround, not proof that the provider alone was guilty.

The phrase turn-accepted is the diagnostic problem in miniature. It says the runtime accepted responsibility for the work. It does not say where the work went next, what it waited on, which model account was tried, whether fallback was visible, whether a harness was retired, whether a session lock blocked progress, or whether the turn had any chance of producing a message. For humans on call, that is the worst kind of status: confidently incomplete.

OpenClaw’s nearby release work is relevant here. The same v2026.5.12-beta.3 release adds narrower cron inspection through PR #75117, visible fallback-failure surfacing through PR #80917, and Codex-native subagent task mirroring through PR #79512. That is the right neighborhood of fixes: inspect one job directly, show visible errors when fallback fails silently, and mirror background workers into a task registry. The bug report is effectively a test case for whether those observability primitives go far enough.

The operator playbook is boring, which is good

If you run OpenClaw with both cron jobs and live channels, the immediate checklist is not glamorous. Audit stored cron jobs for stale sessionKey values, especially keys that point at discord:direct or other live chat lanes while the job is supposed to be isolated. Compare declared sessionTarget against the actual stored key. After beta upgrades, inspect running tasks and last-run errors separately; old failures can survive mitigation and should not be mistaken for active stuck work. If OpenAI/Codex-family scheduled jobs begin timing out after turn acceptance, temporarily pin critical cron jobs to a known-good provider and preserve logs before restarting.

For maintainers, the fix should be more structural. Normalize cron session state at load time. Treat stale live-lane keys on isolated jobs as repairable corruption. Make lane acquisition visible in diagnostics. Split turn-accepted into phases operators can act on: provider request pending, harness startup, model timeout, fallback attempt, session lock wait, context engine delay, response delivery. If the runtime can tell the difference internally, the operator should not have to reverse-engineer it from log confetti.

The deeper lesson is that schedulers are not side features in agent platforms. They are production traffic. A cron job can read memory, call tools, send messages, consume model capacity, block a session, and collide with live user expectations. That means schedulers need lane isolation, priority semantics, timeout taxonomy, and health checks that include end-to-end completion. “The gateway is up” is table stakes. “This human-facing channel is not blocked by background work” is the bar.

Issue #81234 is beta-specific, but the lesson is general. Once agents can run on schedules and answer humans in the same runtime, session lanes become production infrastructure. If the platform treats them as incidental strings, outages will look healthy right until somebody asks why the bot went silent.

Sources: OpenClaw issue #81234, OpenClaw v2026.5.12-beta.3 release, PR #75117, PR #80917, PR #79512