Failure in Complex Systems
After listening to the hosts discuss probability on ATP this week, I was most of the way through writing something that, had I finished it, would have been this Dr Drang post only not nearly as good. (I will confess that my motivation was exactly the same as his: “People believe John”.) In fairness I don’t blame them for getting confused, because probability really is confusing and I’m terrible at it myself. Seeing as the good doctor has already intervened, there’s no need to repeat what he said. But let me add a small coda. Dr Drang wonders why things went off the rails in the first place. He suggests that perhaps “Marco’s confusion about multiplication came from a hazy memory of multiplying the probabilities of the complements of failures”. I have a slightly different answer.
The reasonable thought behind Marco and John’s discussion is that adding elements to a system makes errors more likely in some quicker-than-expected way. The intuition here runs not to probability but to the closely related world of combinations and permutations. Let’s say you’re Casey Liss’s Dad and it’s 1986. You have a phone, a thermostat, a lightbulb, a stereo, a garage door opener, and a home computer. Each of these items has some probability of failing. Assuming the events are independent, the probability of the phone breaking at the same time the thermostat fails is, as Dr Drang explains, very small. A small chance times a small chance is an even smaller chance. The probability of all those devices independently failing at once is absolutely tiny.
Now let’s suppose that you’re Casey Liss and it’s 2015. You have all the same items in your house, but they all talk to one another in your spiffy home automation system. (I’m oversimplifying, obviously. Maybe the garage door opener can’t talk to the lightbulb yet. But you get the idea.) What matters here is not the still-low probability of independent devices jointly failing, but the relatively large number of newly possible device interactions, each with its own chance of going wrong. As Marco remarked, “we make our devices do more and have more devices interacting with each other and with cloud services”. This creates the possibility of a little combinatorial explosion.1 What used to be six independent devices is now a network of interdependent entities.
In the simplest case, linearly adding devices that can all talk to each other increases the number of possible pairs of interactions exponentially.2 Five devices make for ten combinations of two-way communication; six devices make for fifteen; seven for twenty one, and so on. You’re not multiplying probabilities of joint failure, you’re increasing the sheer number of possible interactions in the system. This increases the overall chance that something, somewhere, will go wrong. Moreover, the number of new points of potential failure increases much faster than the one-at-a-time rate you’re adding devices. And while the lightbulb may not yet talk to the garage door, in real systems the proper way to enumerate things that might fail is to count not just the number of pairwise or multiplex interactions within and between physical devices, but the number of distinct modes or processes of interaction that each device can have with itself and all the other devices. From there it’s a short hop to the nightmarish world of unexpected interdependencies, unforeseen conflicts, and cascading failures.
The organizational sociologist Chick Perrow wrote a book about this once, called Normal Accidents, arguing that in sufficiently complex interacting systems accidents are “unexpected, incomprehensible, uncontrollable and unavoidable”. So whereas back in 1986 Mr Liss, Sr, could buy a new TV and not have to worry that whether it was going to have a bad effect on his phone, Casey might have to wonder whether adding a smart TV to his home network is going to do something bad to his garage door.