← Back to Blog

WebSocket Keepalive: Diagnosing Zombie Connections and Zero-Cost Pings

"There seems to be an issue with the chat shoutbox... network error when sending message... maps connected dropped in half." That's the kind of bug report that makes you drop everything. A live feature that users depend on, silently failing for half of them.

What followed was a multi-hour troubleshooting session that uncovered two distinct bugs, revealed an interesting quirk of WebSocket connections, and led to a cost optimization that saved us from a billing surprise. Here's the complete story.

The Symptoms

Our EVE Frontier Map has a shared WebSocket connection powering both the live event ticker and the universe chat shoutbox. Users were reporting that sending a chat message showed "Network error"—yet they could still see incoming messages just fine. Even stranger, our "maps connected" counter had dropped from around 60 to around 30.

The WebSocket appeared connected from the browser's perspective (readyState === OPEN), but something was fundamentally broken. We had a classic zombie connection: the socket looked alive but was effectively dead for sending.

Bug #1: POST to /null

Before diving into WebSocket internals, we checked the browser console and found something unexpected:

POST https://ef-map.com/null 404 (Not Found)

This was our usage telemetry system failing silently. The usage.ts module attempts to detect the correct API endpoint at initialization, but in certain edge cases the detection would fail, leaving the endpoint variable as null. The code then tried to POST usage events using TypeScript's non-null assertion:

fetch(USAGE_ENDPOINT!, { ... })

When USAGE_ENDPOINT was null, this became "null" as a string, resulting in relative URL resolution to https://ef-map.com/null. Harmless but noisy—and a sign that endpoint detection wasn't reliable.

The fix was straightforward:

if (!USAGE_ENDPOINT || disabledDueToMissingEndpoint) {
  QUEUE.length = 0;  // Silently discard
  return;
}

The Real Problem: Zombie WebSockets

The POST bug was a red herring. The actual issue was that WebSocket connections can enter a zombie state—the browser thinks they're open, the server thinks they're closed, and messages disappear into the void.

This happens when:

The browser's WebSocket API doesn't expose the underlying TCP connection state. readyState stays at OPEN until you try to send something and the underlying system detects the broken pipe—which can take many minutes.

The Solution: Client-Side Keepalive

The standard solution is a keepalive ping mechanism. Send a periodic message, expect a response, and if no response comes, force a reconnect. We added this to our UniverseWebSocketContext:

const pingInterval = 120000;  // 2 minutes
const staleTimeout = 180000;  // 3 minutes

// Ping timer
const pingTimer = setInterval(() => {
  if (ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify({ type: 'ping' }));
  }
}, pingInterval);

// Staleness check
const staleCheck = setInterval(() => {
  if (Date.now() - lastMessageTime > staleTimeout) {
    console.log('WebSocket stale, forcing reconnect');
    ws.close();
    reconnect();
  }
}, 30000);

Now if the WebSocket goes zombie, we detect it within 3 minutes and automatically reconnect. But this raised a new question: what's the cost impact of all these ping messages?

Cloudflare Durable Objects Billing

Our WebSocket is powered by Cloudflare Durable Objects with WebSocket Hibernation. The billing model has some interesting characteristics:

Durable Objects WebSocket Billing

  • Incoming messages: 20 messages = 1 billed request
  • Duration charges: $0.000024/GB-s while "active"
  • Hibernation: Zero duration charges while sleeping
  • Free tier: 100,000 requests/day, 13,000 GB-s/day

With 60 concurrent users and 2-minute pings, that's 30 pings/minute × 60 minutes × 24 hours = 43,200 ping messages per day, or about 2,160 billed requests just for keepalives.

But worse: each ping wakes the Durable Object from hibernation, triggering duration charges. At scale, this adds up.

RFC 6455 Ping/Pong: The Dream That Wasn't

The WebSocket protocol (RFC 6455) has built-in ping/pong frames at the protocol level, distinct from application messages. If we could use those, we might avoid waking the Durable Object entirely.

Unfortunately, the browser WebSocket API doesn't expose ws.ping(). Only WebSocket servers can send protocol-level pings—browsers automatically respond with pong but cannot initiate. No luck there.

The Revelation: setWebSocketAutoResponse()

While researching Cloudflare's Hibernation API, we discovered a feature designed exactly for this scenario: setWebSocketAutoResponse().

This allows you to configure the Durable Object to automatically respond to specific messages without waking up. The response happens at Cloudflare's edge, with zero duration charges and zero compute overhead:

// In Durable Object constructor
this.ctx.setWebSocketAutoResponse(
  new WebSocketRequestResponsePair(
    JSON.stringify({ type: 'ping' }),
    JSON.stringify({ type: 'pong', timestamp: Date.now() })
  )
);

Now when a client sends {"type":"ping"}, Cloudflare's edge immediately responds with {"type":"pong",...}—the Durable Object stays hibernating, no CPU cycles consumed, no duration charges. Zero cost keepalives.

Optimization: Don't Ping When Events Are Flowing

With the cost problem solved, we could optimize further. Our live event ticker generates around 10 events per minute during active periods. Each incoming event proves the connection is alive just as well as a ping response would.

Why send a ping if we received an event 30 seconds ago? We added a simple check:

pingTimer = setInterval(() => {
  const timeSinceLastMessage = Date.now() - lastMessageTime;
  
  // Skip ping if we recently received any message
  if (timeSinceLastMessage < pingInterval * 0.9) {
    return; // Connection proven alive by recent traffic
  }
  
  if (ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify({ type: 'ping' }));
  }
}, pingInterval);

Now pings only fire during genuine quiet periods—when the user is connected but no universe events are happening. In practice, with 15,000+ events per day flowing through the system, most users never send a single explicit ping.

Final Configuration

Parameter Value Rationale
Ping interval 2 minutes Long enough for events to flow; short enough to catch zombies
Stale timeout 3 minutes One missed ping + buffer before forced reconnect
Conditional ping 90% of interval Skip ping if any message received in last 108 seconds
Auto-response Enabled Zero-cost pong at edge, no DO wake

Clearing the Shoutbox

As part of the fix, we also added an admin endpoint to clear the shoutbox history. The zombie connections had left orphaned messages that were confusing users on reconnect. A quick DELETE to /api/universe-events/shoutbox/clear let us start fresh.

Lessons Learned

  1. WebSocket readyState lies—a socket can show OPEN while being completely dead for sending. Always implement keepalive detection.
  2. Check the console first—the POST to /null was a separate bug that could have been mistaken for the main issue.
  3. Understand your billing model—Cloudflare's 20:1 message ratio and hibernation semantics meant keepalives could be expensive if implemented naively.
  4. Read the docs thoroughlysetWebSocketAutoResponse() is exactly designed for keepalives but easy to miss in the Hibernation API documentation.
  5. Don't ping unnecessarily—if live traffic proves the connection is alive, save the message.

The Result

After deploying these changes:

What started as a "network error" bug report turned into a deep dive on WebSocket reliability, Cloudflare billing, and the subtle art of keeping connections alive without waking sleeping servers.

websocketdurable objectskeepalivecloudflarezombie connectionsauto-responsecost optimizationtroubleshootingeve frontier