I once worked for a startup called "Totality." Our business was outsourced web operations for companies that either didn’t want to invest or lacked the skills to build and staff their own24x7 operations center. We handled all the production management, change management, incident management—essentially the entire ITIL (IT Infrastructure Library) suite of processes.
During my time at Totality, we observed that nearly 50% of all our software outages happened within 24 hours of a software release. Since we were on the hook for uptime, but not new features, our response was obvious: Stop touching things!
Minimizing change to keep systems stable was a common practice. Ask any admin from those days. They’ll all tell you that the prevailing sentiment was to leave the systems alone, because the more you change things, the more you mess them up. Your servers start in a state of grace and can only be corrupted from there. For several decades, this approach to managing risk has prevailed. Control and manage change. Any changes must be scrutinized, so we made sure humans put their eyeballs on everything, regardless of how much that slowed everything down.
We were wrong. The true premise of change management should be this: making changes in software systems is painful so we should do it more often to get really good at it. If we make changes frequently, then we have to develop safeguards. A change to one part of a system looks like downtime to another part, so we must make our systems defend against continuous partial failure. Instead of minimizing risk by resisting change, we should change so often that everyone is absolutely forced to make their stuff resilient.
When John Allspaw, now CTO of Etsy but formerly head of operations at Flickr, delivered his presentation, “10 Deploys per Day: Dev and ops cooperation at a day at Flickr” at the 2009 Velocity conference, the idea of doing even one change every day was kind of mind blowing. To say you were doing 10 sounded like complete insanity. But John showed that getting to that frequency of change forced you to do things differently, that you could use that rate of change as an advantage.
Think about your first time on an ice rink. You probably started by clumsily hanging on to the side railing. But you’ll have to take a risk and let go to skate to the center of the ice. And if you’re playing hockey, you can’t score a goal if you’re clutching the side. You have to let go of the railing and do things differently.
To move towards antifragility, you have to let go of the railing, let go of the traditional approach to running your systems. Here are four paths to get you there:
1. Make Changes Frequently So You Get Really Good At It — By making changes frequently, you get things done in a shorter amount of time, and reduce your backlog, or inventory. You’ll get the reduction in cycle time that lean efforts aim for, with continuous delivery as the ultimate end point. (See more about the high costs of inventory in my previous post “No Silver Bullet: Resilience Isn’t Enough.”)
2. Do Game Day Exercises — The Game Day exercise is sort of a new take on disaster recovery or disaster readiness plans. The idea of a disaster readiness plan went something like this: your company stages a disaster, for example a hurricane, assumes that your primary data center is offline, and that some number of randomly selected people aren't coming into the office. With this scenario in place, you test to see if you can maintain continuous operations.
The premise is this: if you never test your disaster recovery plan, then you never know if it will work. Test it once, and you know it works then. But things will drift away, so you want to test periodically. The more often you go through the exercise, the higher your confidence and the more streamlined the execution will be. If you crank up the frequency and do the drills continuously, you will eventually be able to survive continuous small disasters. You will just build everything to survive shocks and impacts.
3. Inject Chaos — This concept is best embodied by the Chaos Monkey and the rest of the Simian Army, and represents the continuous version of the Game Day exercise. The idea is to deliberately introduce failure to make sure you system is failure resilient, deliberately introduce random latency to make sure you’re tolerant of latency. I’ll talk more about injecting variation as a means to antifragility in a coming post.
4. Architect For Change — In the old way, we built safety into the process. Humans met and reviewed every change. These people were supposed to have a kind of software clairvoyance, to read a change request form and know whether it was correct and safe. Of course, it didn't always work because most people aren't clairvoyant. Instead, they would look to see if the form was filled out correctly. In the gap between filling out the form correctly and predicting the impact of a software deployment, we can find a lot of downtime.
To really crank up the rate of change, we have to build safety into the architecture and the organization. We must find tools, platforms, and patterns of interaction that allow independently varying components to be deployed at any time. Microservices are one approach, but not the only one. Other kinds of architecture also allow this evolutionary dynamic.
We must also build teams that are autonomous and skilled. They have to be connected to user experience and business results.
The four paths discussed here are all ways to reach antifragilty by embracing change. "Embrace change" was the motto of eXtreme Programming, back in the days when agile development was new and radical. When Kent Beck wrote “eXtreme Programming Explained,” he put "Embrace Change" front and center as the subtitle. They are still wise words.
The old path to change management was to reduce risk by minimizing change. That may have reduced the day-to-day risk, but it cause a huge pileup of existential risk in unmet business needs. We need to move from the traditional risk management perspectives of our legacy architectures to the new proactive approach: minimize risk by maximizing change. The new path takes change and makes it a power, not a weakness.
The new path demands a shift in technologies, tools and mindset. We already have the tools to reduce the cost of building things to the point where you can think about disposable code and decomposing your system into very small pieces. These very small pieces are antifragile because you can take a system apart and recombine the pieces in new and interesting ways.
Next we will talk about why you want to shut off systems and get rid of code at every opportunity.
Read all of Michael Nygard's The New Normal series here.