~ ~ ~

The power of status updates

# October 19, 2022

When I was leaving Paris by train out of Gare du Nord, we showed up twenty minutes before our scheduled departure. Typically that's more than enough time to grab a bite, find your cabin, and settle in before you roll out of the station. We checked the schedule board and saw our train wasn't yet assigned a platform. No problem. We got some food and came back around. Still no official gate.

As the clock ticked closer to departure, people started frantically running around. "Do you know which gate is ours?" "Is that our train?" With every minute that went by a larger crowd assembled around the status board. When one person caught wind of a rumor, there was a manic movement of people to a new platform to try and board that train. Everyone was turning to their neighbors and asking if they had any idea what is going on.

We had pieced together which track our train was on and that it was delayed by an unknown period. We told one group what we knew. Then another. Before we knew it we were performing crowd control for a sea of people, telling them they needn't worry and that we would all get on a train.

When people don't have any update, they start to question reality. Are they missing something; is it their fault; will the problem ever get addressed? It puts the onus of information gathering up to them and that onus leads to stress.

People are understanding of setbacks when they happen but only when they know what's happening. That was the key issue in the mayhem at du Nord. There was no official communication or even acknowledgement of the delay. This vacuum pushed an annoying situation over the edge. A delayed train is one thing. The concern that you might miss it entirely is far worse.

Software is the same way. That's the power of outage dashboards. When people can confirm there's a known issue and there's an acknowledgment that people are working on it, it's comforting. They give the impression that you're equally updated to the issue as internal employees. An unacknowledged outage results in the same blame game that a physical situation would.

There's a reason why dashboards have become increasingly common over the last decade (and why Atlassian bought StatusPage). Hearing from people with more context can immediately dissolve fears. The outage playbooks that we ran at Globality had a sub 15min reporting window so we could update affected clients. Even if the update only featured an acknowledgement that it's still under investigation - or technical details that many won't understand - they still went a long way. If you're stuck between under sharing and over sharing, over share. It shows that all hands (or at least some hands) are on deck.

I remember one outage at Abstract with the login system. Some SSO accounts became delinked from their primary record and couldn't login. Cole, the CTO, started emailing the client who reported the issue to give a play by play of the updates. All-in-all it took 30 minutes to fix. The client walked away incredibly thankful at the end of the outage, and a bigger advocate for the platform than when they started the day. The clear and frequent communication separated his experience from the status quo.

Interestingly it's still not a de-facto standard for most legacy software players to provide outage tracking. That's probably a cultural barrier - many organizations don't want to publicly acknowledge when there's a problem. It can harm their initial pitch of reliability. But in the long term the lack of clear reporting hurts more than helps. It decreases trust in the application and in the organization. That might actually provide the opening for a challenger to contend for the throne.

For software and for trains, trust is everything. If only du Nord got the memo.

Stay in Touch

I write mostly about engineering, machine learning, and company building. If you want to get updated about longer essays, subscribe here.

I hate spam so I keep these infrequent - once or twice a month, maximum.