4 testing takeaways from “Meltdown” (Chris Clearfield & András Tilcsik)

I recently read “Meltdown” by Chris Clearfield & András Tilcsik. It was an engaging and enjoyable read, illustrated by many excellent real-world examples of failure. As is often the case, I found that much of the book’s content resonated closely with testing and I’ll share four of the more obvious cases in this blog post, viz:

  1. The dangers of alarm storms
  2. Systems and complexity
  3. Safety systems become a cause of failure
  4. The value of pre-mortems

1. The dangers of alarm storms

Discussing the failures around the infamous accident at the Three Mile Island nuclear facility in 1979, the book looks at the situation faced by operators in the control room:

An indicator light in the control room led operators to believe that the valve was closed. But in reality, the light showed only that the valve had been told to close, not that it had closed. And there were no instruments directly showing the water level in the core so operators relied on a different measurement: the water level in a part of the system called the pressurizer. But as water escaped through the stuck-open valve, water in the pressurizer appeared to be rising even as it was falling in the core. So the operators assumed that there was too much water, when in fact they had the opposite problem. When an emergency cooling system turned on automatically and forced water into the core, they all but shut it off. The core began to melt.

The operators knew something was wrong, but they didn’t know what, and it took them hours to figure out that water was being lost. The avalanche of alarms was unnerving. With all the sirens, klaxon horns, and flashing lights, it was hard to tell trivial warnings from vital alarms.

Meltdown, p18 (emphasis is mine)

I often see a similar problem with the results reported from large so-called “automated test suites”. As such suites get more and more tests added to them over time (it’s rare for me to see folks removing tests, it’s seen as heresy to do so even if those tests may well be redundant), the number of failing tests tends to increase and normalization of test failure sets in. Amongst the many failures, there could be important problems but the emergent noise makes it increasingly hard to pick those out.

I often question the value of such suites (i.e. those that have multiple failed tests on every run) but there still seems to be a preference for “coverage” (meaning “more tests”, not actually more coverage) over stability. Suites of tests that tell you nothing different whether they all pass or some fail are to me pointless and pure waste.

So, are you in control of your automated test suites and what are they really telling you? Are they in fact misleading you about the state of your product?

2. Systems and complexity

The book focuses on complex systems and how they are different when it comes to diagnosing problems and predicting failures. On this:

Here was one of the worst nuclear accidents in history, but it couldn’t be blamed on obvious human errors or a big external shock. It somehow just emerged from small mishaps that came together in a weird way.

In Perrow’s view, the accident was not a freak occurrence, but a fundamental feature of the nuclear power plant as a system. The failure was driven by the connections between different parts, rather than the parts themselves. The moisture that got into the air system wouldn’t have been a problem on its own. But through its connection to pumps and the steam generator, a host of valves, and the reactor, it had a big impact.

For years, Perrow and his team of students trudged through the details of hundreds of accidents, from airplane crashes to chemical plant explosions. And the same pattern showed up over and over again. Different parts of a system unexpectedly interacted with one another, small failures combined in unanticipated ways, and people didn’t understand what was happening.

Perrow’s theory was that two factors make systems susceptible to these kinds of failures. If we understand those factors, we can figure out which systems are most vulnerable.

The first factor has to do with how the different parts of the system interact with one another. Some systems are linear: they are like an assembly line in a car factory where things proceed through an easily predictable sequence. Each car goes from the first station to the second to the third and so on, with different parts installed at each step. And if a station breaks down, it be immediately obvious which one failed. It’s also clear What the consequences will be: cars won’t reach the next station and might pile up at the previous one. In systems like these, the different parts interact in mostly visible and predictable ways.

Other systems, like nuclear power plants, are more complex: their parts are more likely to interact in hidden and unexpected ways. Complex systems are more like an elaborate web than an assembly line. Many of their parts are intricately linked and can easily affect one another. Even seemingly unrelated parts might be connected indirectly, and some subsystems are linked to many parts of the system. So when something goes wrong, problems pop up everywhere, and it’s hard figure out what’s going on.

In a complex system, we can’t go in to take a look at what’s happening in the belly of the beast. We need to rely on indirect indicators to assess most situations. In a nuclear power plant, for example, we can’t just send someone to see what’s happening in the core. We need to piece together a full picture from small slivers – pressure indications, water flow measurements, and the like. We see some things but not everything. So our diagnoses can easily turn out to be wrong.

Perrow argued something similar: we simply can’t understand enough about complex systems to predict all the possible consequences of even a small failure.

Meltdown, p22-24 (emphasis is mine)

I think this discussion of the reality of failure in complex systems makes it clear that trying to rigidly script out tests to be performed against such systems is unlikely to help us reveal these potential failures. Some of these problems are emergent from the “elaborate web” and so our approach to testing these systems needs to be flexible and experimental enough to navigate this web with some degree of effectiveness.

It also makes clear that skills in risk analysis are very important in testing complex systems (see also point 4 in this blog post) and that critical thinking is essential.

3. Safety systems become a cause of failure

On safety systems:

Charles Perrow once wrote that “safety systems are the biggest single source of catastrophic failure in complex, tightly coupled systems.” He was referring to nuclear power plants, chemical refineries, and airplanes. But he could have been analyzing the Oscars. Without the extra envelopes, the Oscars fiasco would have never happened.

DESPITE PERROW’S WARNING, safety features have an obvious allure. They prevent some foreseeable errors, so it’s tempting to use as many of them as possible. But safety features themselves become part of the system – and that adds complexity. As complexity grows, we’re more likely to encounter failure from unexpected sources.

Meltdown, p85 (Oscars fiasco link added, emphasis is mine)

Some years ago, I owned a BMW and, it turns out, it was packed full of sensors designed to detect all manner of problems. I only found about some of them when they started to go wrong – and doing so much more frequently than the underlying problems they were meant to detect. Sensor failure was becoming an everyday event, while the car generally ran fine. I solved the problem by selling the car.

I’ve often pitched good automation as a way to help development (not testing) move faster with more safety. Putting in place solid automated checks at various different levels can provide excellent change detection, allowing mis-steps during development to be caught soon after they are introduced. But the author’s point is well made – we run the risk of adding so many automated checks (“safety features”) that they themselves become the more likely source of failure – and then we’re back to point 1 of this post!

I’ve also seen similar issues with adding excessive amounts of monitoring and logging, especially in cloud-based systems, “just because we can”. Not only can these give rise to bill shock, but they also become potential sources of failure in themselves and thereby start to erode the benefits they were designed to bring in diagnosing failures with the system itself.

4. The value of pre-mortems

The “premortem” comes up in this book and I welcomed the handy reminder of the concept. The idea is simple and feels like it would work well from a testing perspective:

Of course, it’s easy to be smart in hindsight. The rearview mirror, as Warren Buffett once supposedly said, is always clearer than the windshield. And hindsight always comes too late – or so it seems. But what if there was a way to harness the power of hindsight before a meltdown happened? What if we could benefit from hindsight in advance?

This question was based on a clever method called the premortem. Here’s Gary Klein, the researcher who invented it:

If a project goes poorly, there will be a lessons-learned session that looks at what went wrong and why the project failed – like a medical postmortem. Why don’t we do that up front? Before a project starts, we should say, “We’re looking in a crystal ball, and this project has failed; it’s a fiasco. Now, everybody, take minutes and write down all the reasons why you think the project failed.”

Then everyone announces what they came up with – and they suggest solutions to the risks on the group’s collective list.

The premortem method is based on something psychologists call prospective hindsight – hindsight that comes from imagining that an event has already occurred. A landmark 1989 study showed that prospective hindsight boosts our ability to identify reasons why an outcome might occur. When research subjects used prospective hindsight, they came up with many more reasons – and those reasons tended to be more concrete and precise – than when they didn’t imagine the outcome. It’s a trick that makes hindsight work for us, not against us.

If an outcome is certain, we come up with more concrete explanations for it – and that’s the tendency the premortem exploits. It reframes how we think about causes, even if we just imagine the outcome. And the premortem also affects our motivation. “The logic is that instead of showing people that you are smart because you can come up with a good plan, you show you’re smart by thinking of insightful reasons this project might go south,” says Gary Klein. “The whole dynamic changes from trying to avoid anything that might disrupt harmony to trying to surface potential problems.”

Meltdown, p114-118

I’ve facilitated risk analysis workshops and found them to be useful in generating a bunch of diverse ideas about what might go wrong (whether that be for an individual story, a feature or even a whole release). The premortem idea could be used to drive these workshops slightly differently, by asking the participants to imagine that a bad outcome has already occurred and then coming up with ways that could have happened. This might result in the benefit of prospective hindsight as mentioned above. I think this is worth a try and will look for an opportunity to give it a go.

In conclusion

I really enjoyed reading “Meltdown” and it gave me plenty of food for thought from a testing perspective. I hope the few examples I’ve written about in this post are of interest to my testing audience!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s