OpenClaw’s ENETDOWN Crash Is a Reminder That SSRF Guards Are Still Network Code
Security code does not get a pass on being production code. OpenClaw issue #86688 is a clean reminder: an SSRF guard that crashes the gateway when the network drops has moved the risk, not removed it.
The report says OpenClaw’s gateway can exit on an uncaught ENETDOWN error while the local network is unavailable and the SSRF guard is doing outbound DNS/connect work for Telegram’s API. On macOS, that can become more than a quick restart. Launchd’s respawn protection can silently park the LaunchAgent after repeated crashes, and one deployment reportedly sat broken for 2.4 days before the operator noticed. That is the worst kind of always-on-agent failure: not dramatic, just absent.
The bug is reproduced across OpenClaw 2026.5.7, 2026.5.20, and 2026.5.22, on macOS Apple Silicon with Node 25.9.0 and a pnpm install. The stack trace includes Error: connect ENETDOWN <telegram-api-ip>:443, Node net internals, undici/lib/core/connect.js, and OpenClaw’s bundled SSRF guard chunk. The stability bundle reason is uncaught_exception, with the error object carrying code: "ENETDOWN". That last detail is the fix clue. Node’s own guidance is to identify system errors using error.code, because messages can vary.
An SSRF guard is still an outbound network client
SSRF protection is usually discussed in policy language: block internal destinations, prevent DNS rebinding, route configured URLs through a safe fetch path, stop plugins from turning the gateway into a network pivot. That framing is correct but incomplete. The implementation still has to resolve names, open sockets, handle sleep/wake, survive interface changes, and classify boring operating-system failures. If it cannot do that, the security layer becomes an availability problem.
This matters because OpenClaw’s attack surface is network-heavy by design. Telegram, Slack, webhooks, browser tools, plugin channels, image callbacks, and custom integrations all route through code that has to decide whether outbound access is allowed. Centralizing SSRF checks is the right architectural move; raw fetch() scattered across extensions is how agent platforms become side-channel machines. But centralization also concentrates reliability obligations. If the shared guard mishandles ENETDOWN, every protected outbound path inherits that fragility.
The related fix PR, #86696, takes the practical route: add ENETDOWN to transient and benign network-error classifications. Specifically, the report says it adds the code to TRANSIENT_NETWORK_CODES, BENIGN_UNCAUGHT_EXCEPTION_NETWORK_CODES, TIMEOUT_ERROR_CODES, and message regexes, with tests for isTransientNetworkError() and isBenignUncaughtExceptionError(). That is not glamorous. It is exactly the kind of taxonomy maintenance that keeps long-running systems alive.
Launchd parking turns a crash into message loss
The macOS service-manager detail raises the severity. A gateway crash that immediately restarts is bad but visible. A gateway that crash-loops until launchd decides to stop trying is worse because the process does not necessarily fail in a way the user notices. Messages are missed, cron jobs do not run, callbacks do not land, and the operator may only discover the outage when someone complains that the agent went quiet.
The issue labels reflect that operational reality: P1, impact:message-loss, and impact:crash-loop, with ClawSweeper marking the fix shape clear and queueable. The report’s timeline is also instructive. One outage lasted from 2026-05-21 to 2026-05-23. A later five-minute watchdog caught another recovery within about ten minutes of openclaw update. That implies this is not a synthetic “pulled the cable in a lab” bug. It is the kind of failure that appears under real sleep/wake, flaky network, and service-supervisor behavior.
For personal agents and channel bots, message freshness is a health signal. Process liveness is not enough. Port binding is not enough. Even supervisor status is not enough if the supervisor can silently park after repeated failures. An agent that misses Telegram messages for two days is down, even if the root cause looks like one missing network error code.
What operators and platform builders should do
Operators should upgrade once the fix lands, but they should not stop there. Add external liveness checks if the agent matters. Monitor not just process presence but successful channel polling, outbound delivery, cron execution recency, and message freshness. Alert when a service has not handled a message or scheduled job within its expected window. If you rely on macOS launchd, inspect restart behavior and failure thresholds rather than assuming KeepAlive=true means “forever.”
Platform builders should treat network error taxonomy as part of the security feature. Test interface-down, DNS failure, DNS success followed by TCP failure, provider unreachable, laptop sleep/wake, IPv6 weirdness, captive portals, and long-poll interruption. Classify by error.code, not by string-matching the happy path. Then decide which errors are retryable, which are reportable, which should trip circuit breakers, and which should be allowed to bubble to a crash boundary. ENETDOWN belongs in the recoverable/transient family for a channel integration path; it should not be an uncaught exception that kills the gateway.
The broader lesson is useful beyond OpenClaw. Agent security work often focuses on preventing the spectacular failure: credential exfiltration, SSRF into metadata services, malicious plugins, tool abuse. But the dull failure can be just as damaging. If the protection layer takes the agent offline whenever the network has a bad moment, users will route around it, disable it, or stop trusting the platform. Good security boundaries degrade safely. They do not become single points of crash.
So the take is not “SSRF guards are bad.” The take is that SSRF guards are production network clients now. They own DNS, connect errors, retries, classification, and graceful failure. OpenClaw’s fix is small, but the lesson is large: if security code sits on the hot path, reliability is part of the security model.
Sources: OpenClaw issue #86688, OpenClaw PR #86696, Node.js error documentation, OpenClaw v2026.5.22 release