Antifragile Workflows: What Pressure Reveals

Watch floor alerts at 2 a.m. have a lot in common with production deadline incidents. The details differ, but the overall shape is often similar. Something breaks, and you have to piece together a solution in a hurry off memory and a bit of luck.

Pressure doesn’t create new cracks. It just exposes the ones that were always there: the undocumented dependency, the unowned handoff, the missing rollback, the single person who knows how something works.

A brittle workflow can coast for months on calm conditions and good intentions. Then one assumption breaks, and the whole thing collapses.

I like to ask 3 questions to reveal the truth fast:

Cut the time in half. What fails first?
Take the most knowledgeable person offline. What stops?
Kill the favorite tool. What becomes impossible?

If the answer to any of those is “everything,” you’ve found your single points of failure.

The fixes are almost always boring, but sometimes boring is good. It sure beats panicking at the last minute.

First start with clarity. For any recurring workflow, you should be able to state the objective, the owner, and the done criteria without a debate. If you can’t say those three things plainly, the debate will happen later, under pressure, and everyone will call all the commotion “team coordination.”

Then tighten boundaries. Under stress, more options rarely help. You want fewer ways to do the wrong thing. Safe defaults. Reversible actions. Short checklists for the high-risk steps that people will actually run when they’re tired.

Make rollback real. A lot of workflows rely on “we can fix it later,” which turns into debt the moment things get hot. A clean rollback path is a resilience multiplier. It lets you move fast without betting the farm on every step.

Finally, do the thing most teams skip: turn surprises into documented lessons with fixes that followed-up on and implemented. The best post-incident review in the world is worthless if it never changes what the next person does. Remove a dependency. Clarify a handoff. Replace an unsafe default. One small patch per incident compounds over months.

The bar I use now: could a competent, tired person execute this with incomplete context? If not, it isn’t robust yet. And if they could, the next question is whether the workflow actually improves after being stressed. If it does, you’re building something that absorbs real-life hits instead of turning every incident into a new fire drill.