engineeringreal-timewebrtcwebsocketsarchitecture

Real-Time Is Harder Than It Looks

13 March 2026

Real-time looks easy from the outside. You open a WebSocket, you push events, clients update. The demo works in ten minutes. Then you put it in production and spend the next year understanding what the demo hid.

I've built real-time into multiple products. Clover and Argan are messaging and conferencing platforms — full WebSocket-based messaging, audio and video calls, presence indicators, typing indicators, delivery receipts. Then I rewrote the call engine, ending up at Elderberry built on mediasoup. Here's what I actually learned.

WebSockets aren't hard — connection state is

The WebSocket API is simple. You connect, you send messages, you receive messages, you handle disconnects. In a controlled environment, it works exactly as expected.

The complexity isn't in the API. It's in the state machine you have to build around it.

What happens when a client disconnects mid-session? What happens when they reconnect — do they rejoin a room? With what state? What if the server restarts while clients are connected? What if two browser tabs are open and the user sends a message from one — does the other tab update?

All of these are cases you don't encounter in local development. All of them happen in production, constantly, from the first day you have real users. The naive implementation handles none of them. The production implementation is mostly connection state management with a thin layer of actual messaging on top.

Presence is the hardest part

Real-time presence — "who is currently online" — seems like a simple feature. It's one of the most reliably tricky things to implement correctly.

The obvious approach: user connects → mark as online. User disconnects → mark as offline. It breaks immediately, for one specific reason: a user isn't a connection. A user is a person who might have three tabs open, a phone, and a desktop client all connected at once. If one of those disconnects, they're not offline.

The version that actually works tracks connections per user, not per socket. In Clover and Argan, each socket joins a Socket.IO room keyed to the user's ID on connect. The server maintains a counter of active connections per user. A disconnect event decrements the counter — and only marks the user offline if the counter reaches zero. As long as at least one device is still connected, presence stays green.

The implementation is straightforward once you see the model. Getting there requires realizing that the primitive you care about is the user, not the socket. By the time you have stable presence, you've built a small distributed systems problem. Which is fine — it's worth knowing that going in.

WebRTC: peer-to-peer is a lie

WebRTC's pitch is peer-to-peer audio/video — direct connections between browsers, no media server required. In simple cases, this is true. In production, it's mostly not.

The first problem is NAT traversal. Most users are behind routers that block unsolicited inbound connections. Peer-to-peer requires STUN servers to discover public addresses and TURN servers to relay media when direct connection fails. A significant fraction of call attempts will require TURN relay — it's common to see 20–30% of real-world calls relaying through TURN. Running a TURN server is a non-trivial infrastructure burden.

The second problem is topology. In a call with more than two participants, pure peer-to-peer means each participant sends and receives N-1 streams. Four people in a call means each person is sending three video streams and receiving three. This doesn't scale. At six people it's noticeable. At twelve it's unusable.

The solution is a Selective Forwarding Unit — a media server that receives one stream from each participant and forwards the relevant streams to everyone else. This is what mediasoup provides.

Why I rewrote the media engine

The original Clover and Argan used WebRTC with a TURN server. It worked. But the products are self-hosted, sold to people who want to run their own instance. And self-hosting a TURN server is genuinely annoying — it requires a public IP, specific UDP ports open, separate configuration from the main app.

The support burden was dominated by call quality issues that traced back to TURN misconfiguration. Users didn't want to manage the TURN server. They wanted calls to work.

Switching to mediasoup changed the architecture: instead of clients connecting peer-to-peer through TURN, they connect to the mediasoup server directly, which handles all media routing. No TURN server needed. Simpler deployment, better scalability, more control over quality.

The rewrite was significant — the entire call flow changed. But it solved the real problem, which wasn't technical at the WebRTC level. It was operational: users couldn't or wouldn't configure the infrastructure the original architecture required.

What real-time actually requires

After all of this, the pattern I see is that real-time complexity is mostly not in the protocol. WebSockets work. WebRTC works. The complexity is in everything adjacent: connection lifecycle, state recovery, operational simplicity for whoever is running the infrastructure.

The things I'd tell someone starting a real-time system now:

Build the reconnection and state recovery logic early. It's not a polish feature.
Presence based on connection events alone will fail you. Heartbeats aren't optional.
If you need audio/video with more than two participants, you need an SFU. Price that in from the start.
The deployment story is part of the product. An architecture that requires your users to manage infrastructure they don't understand is a support problem waiting to happen.

Real-time is hard. Not because the building blocks are complicated, but because the edge cases are invisible in development and constant in production. Build for them from the start.