openclaw

OpenClaw’s Heartbeat Fleet Problem Is a Scheduler Design Smell, Not Just an Ops Complaint

Anatoliy Kolodkin

23 May 2026 • 4 min read

“Every 30 minutes” sounds harmless until fifteen agents hear it as “all at once.”

That is the scheduler smell documented in OpenClaw issue #85885. An operator running a 15-agent OpenClaw deployment reports that agents configured with default heartbeat: {} fire on the same cadence, producing synchronized gateway load spikes. The attempted fix was the obvious one: replace native heartbeats with staggered cron jobs. But OpenClaw currently blocks cron add --session main for non-default agents, while --session isolated loses the agent’s persistent HEARTBEAT.md routine context. The operator is left with reactive pressure scripts instead of preventive scheduling.

This is not just an ops complaint. It is what happens when an agent platform grows from one assistant into a fleet without giving the scheduler a fleet model.

A heartbeat is not just a timer

In a single-agent setup, a heartbeat is a useful periodic nudge. Check the inbox. Look at the calendar. Review background work. Stay quiet if nothing matters. The default 30-minute cadence is easy to understand and cheap enough.

In a 15-agent deployment, the same default becomes a thundering herd. Every agent wakes, loads context, maybe reads routine files, maybe calls tools, maybe spawns subagents, maybe touches the gateway at the same moment. The operator in #85885 already has a custom pressure_check.sh in a sub-agent spawn wrapper that aborts spawns when event-loop p99 exceeds 1500ms or sustained two-tick CPU/core ratio exceeds 0.85. That is not overengineering. That is an operator compensating for missing load-shaping primitives.

The attempted cron workaround shows where the product boundary is wrong. The command was essentially: add a cron per agent, target the main session, and offset the schedule. OpenClaw rejects that for non-default agents with GatewayClientRequestError: sessionTarget "main" is only valid for the default agent. Use sessionTarget "isolated".... The isolated fallback technically fires, but it creates fresh turns without the agent workspace’s HEARTBEAT.md routine context. That changes the semantics. It is not the same heartbeat at a better time; it is a different execution lane that lacks the instructions that make the heartbeat useful.

That context detail is the heart of the issue. A heartbeat is not merely “run the model every N minutes.” It is a scheduled turn in a particular agent identity, workspace, memory, and routine. Strip those away and you get capacity burn without the behavior the operator configured. Good scheduling APIs preserve execution context while changing timing. Bad ones force users to choose between correct context and safe load distribution.

Agent fleets need distributed-systems controls

The proposed fixes in #85885 are revealing. Option A: native heartbeat.offsetMinutes. Option B: allow --session main for non-default agents in cron API. Option C: allow heartbeat.every: "off" | "disabled". The reporter prefers B over A over C, which makes sense if the real goal is to reuse cron as the flexible scheduler while preserving the agent’s main-session semantics.

But the larger pattern is broader than this one API. Multi-agent systems need the same controls any distributed workload needs: jitter, offsets, concurrency caps, backpressure, explicit disable switches, and observability around scheduled work. “Delete the config key” is not an operator-grade disable mechanism. Neither is synchronized default cadence. Neither is a cron path that only preserves main-session context for the default agent.

The timing with related issue #85871 is also useful context. Filed 28 minutes earlier, it reports a separate heartbeat regression: scheduler does not fire on 2026.5.20 or any tested 5.x version, while 2026.4.23 works. A manual CLI trigger returns OK but does not route to the main agent session. Nearby PR #85881 adds test coverage for target-none scheduler dispatch before closing. Taken together, heartbeat is under active stress. That is what happens when “background nicety” becomes “fleet runtime surface.”

For practitioners running OpenClaw fleets, the immediate work is pragmatic. Audit heartbeat cadence across agents. If they all use defaults, assume collision until measured otherwise. Watch event-loop delay, CPU/core ratio, subagent spawn failures, gateway response time, and token usage around half-hour marks. If your agents have heavy heartbeat routines — inbox checks, web fetches, git status scans, memory review, channel polling — treat those as scheduled jobs with load impact, not background vibes.

If you try to stagger with cron, verify not just that the cron fires but that it lands in the right agent context. Does it read the intended HEARTBEAT.md? Does it have the same workspace? Does it use the same main session or a fresh isolated lane? Does the transcript show a meaningful heartbeat or a hollow synthetic turn? The difference matters because context is part of the workload contract.

The deeper product lesson is that agent scheduling cannot remain an afterthought. Once a platform supports multiple agents with persistent routines, the scheduler becomes part of governance. It decides when agents spend tokens, when they touch tools, when they contend for gateway capacity, and whether background work happens in the right identity. That is operationally equivalent to a fleet scheduler, even if the UI calls it a heartbeat.

OpenClaw’s issue is refreshingly concrete because it avoids vague “make it scalable” language. The operator gives numbers, commands, failure modes, and preferred fixes. That is the right kind of report. The project should treat it as more than an enhancement request. Scheduler semantics are now part of the agent operating surface.

The editorial take: timers are fine for one assistant. Fleets need scheduling policy. If OpenClaw wants users to run specialist agents side by side, it needs heartbeat controls that preserve context, spread load, and make disabling explicit. “Every 30 minutes” was a reasonable default. At fleet scale, it is a distributed-systems decision pretending to be a convenience setting.

Sources: OpenClaw issue #85885, OpenClaw issue #85871, OpenClaw PR #85881, OpenClaw issue #84675

A heartbeat is not just a timer

Agent fleets need distributed-systems controls

Sign up for more like this.