Ramon Chavez
/ Engineering

Antifragile Workflows: What Pressure Reveals

#Engineering #Workflow

I’ve seen the same failure pattern in many different settings.

It shows up on a watch floor at 2 a.m. It shows up in security operations when the queue spikes. It shows up in software when something breaks right before a deadline. The details change, but the shape stays pretty consistent.

Pressure doesn’t magically create new cracks. It mostly exposes what was already fragile: the undocumented dependency, the unowned handoff, the missing rollback, the “one person who knows how this works.”

The annoying part is that a brittle workflow can look fine for a long time. It runs on calm conditions and good intentions. Then one assumption breaks and the workflow becomes whatever is left when memory and luck are removed. Here are the three stress questions I use when I want to see the truth quickly:

  1. If we cut the time in half, what fails first?
  2. If the most knowledgeable person is offline, what stops?
  3. If the favorite tool is down, what becomes impossible?

If any of those answers are basically “everything,” you have single points of failure. The good news is the fixes are usually boring. Start with clarity.

For any recurring workflow, you should be able to say the objective, the owner, and the done criteria without a debate. If you can’t, the debate will happen later under pressure, and you will call it coordination. Then tighten boundaries.

Under stress, more optionality rarely helps. You want fewer ways to do the wrong thing. You want defaults that are safe and reversible. You want short checklists for the high-risk steps that people actually run.

Then make rollback real. A lot of workflows rely on “we can fix it later.” Under pressure, that turns into debt. A clean rollback path is a resilience multiplier because it lets you move fast without betting the farm.

Finally, do the part most teams skip: convert surprises into edits. The best post-incident review in the world doesn’t matter if it never changes what the next person does. The point is to make the next run easier: one dependency removed, one handoff clarified, one unsafe default replaced.

That’s how a workflow gets better over time. It isn’t dramatic. It’s a steady loop of small patches that compound.

The bar I use now is simple: could a competent, tired person execute this with incomplete context? If the answer is no, it isn’t robust yet. And if the answer is yes, the next question is whether the workflow improves after being stressed. If it does, you’re building something that can take real life hits without turning every incident into a new fire drill.