I really dislike writing post openings. It feels tedious to define the problem in leading ways that will entice a reader, especially when the important bits are further down. I wish I could just press a button and get a Cory-like opening blurb with some toilsome bits like “so and so is defined by wikipedia as” and a pithy joke. I wanna automate it.

This post is part of series on automation, the result of many months of reserach and reading. I may adjust these posts as my research grows. If you've got comments leave feedback at the bottom!

Thanks to Arijit Mukherji, Franklin Hu, Jay Shirley, Rajesh Raman, and Sam Boyer for their feedback and reviews of these posts.

Automation is a process being performed with minimal, if any, human involvement. DevOps and SRE commonly recommend aggressive automation to do more with less in modern, complex systems. There’s certainly a lot to be gained from automation, but there’s also a downside when this technology replaces human involvement.

Toxic Teammates

Have you ever worked with a stubborn, uncommunicative teammate? Maybe they didn’t contribute to joint projects or did contribute but didn’t share until the end? Perhaps they insisted on dropping everything at exactly noon every day so they could get their favorite sandwich for lunch, work be damned, and left everyone else to deal with the mess.

Such a teammate is often considered toxic and ruinous to teams everywhere. My goal today is to show you how the automation you’re making in your job is generating these “teammates” and wrecking your happiness, reliability and ability to do cool new things.

Toil and Trouble

One of my responsibilities is keeping an API library up to date. Inside this API is a list of acceptable names for cloud provider services. That list changes periodically and is kept in a bit of Java source code as a map. So every few weeks I copy the block from Java into Go and — by hand — rejigger it into Go syntax and commit it. Each time I think about how I could write a parser to do this… or maybe ask the other team to move the definitions to YAML files so I could rid myself of this toilsome work. Sometimes I even typo a definition and release it to the world. “Damn!” I say, reminding myself that if this were YAML and I had a script I could avoid this happening.

This sort of tedium is common in jobs. Despite my whining, I’ve only had problems with this process two or three times in a year. Each time it was obvious to me what had happened and it was fixed quickly and effectively with very little customer impact. This is because I — as a human — am full of wonder: foresight, imagination, adaptability, and ingenious. While I make errors on some occasions, I more often create safety.1 When I do make errors, I quickly recognize and correct them using my adaptability and ingenuity. Compare this to the heap of new problems I might bring by introducing new dependencies to a well understood process.

How Automation Hurts

The point I’m getting to here is that despite our seemingly genetic distaste for toil we should be extremely careful about reaching for automation.

Automation has generally been introduced to meet the needs of the process rather than the needs of the people working with the process2

My plan is to scare you sober by showing all ways that automation is like that toxic teammate. In a follow up I’ll help you learn how to avoid these problems.

Automating a process without proper consideration, design, and planning can create technical debt, incidents, and undermine all your hard work.

For the rest of this post, imagine a common bit of automation in many orgs: autoscaling. Some sort of latency metric is monitored and, based on thresholds, some other resource like compute is scaled up or down. This replaces a human looking at charts and making a judgement call, or missing an unexpected surge because they were busy watching Tiger King and eating half-gallons of ice cream. Not that I’ve done that.

Imagine our new autoscaling system has worked well for weeks, then a surge in traffic occurs. The autoscaler automation does its job and no human needs to get involved! Sadly, this load consumes all the compute your cloud provider has allotted to you. Tasks across the org begin failing with esoteric error messages as compute grinds to a halt. Since most of us skip error checking for operations that generally succeed we may not even have error messages!

“But Cory”, you say with a smirk, “we have tons of automated things underlying our entire lives and we’re getting by”. Sure, we are. The key difference is definedness. There are some processes which are so well defined, and so unlikely to encounter problems that we’ve been able to free ourselves completely from the toil. For more complex or poorly defined situations, however, human capabilities are still essential.2

Automation Requires More Of Humans

Automation removes the human from involvement in the operation. This is a blessing in reduced fatigue or improved productivity. Unfortunately it’s a curse in situation awareness.

These kinds of second order effects are common with automation because humans are “out of the loop”. This means additional time is required for all these folks to get acquainted with the system, how it works, and what can be done about it.3 If the user doesn’t know about the automation or has forgotten the logic, they may end up fighting what seems like unexplainable behavior! This price is being paid at a shitty time, as we may be dealing with customer side effects and blowing up the entire org’s productivity by invoking the incident machinery.

This situation leaves us with an irony and a paradox.

Irony: the more complex an automation, the more crucial the human becomes.4 Our autoscaler, meant to improve latency, has instead caused a complex series of second order failures that a — or many! — humans most now sort through.

Paradox: our automation was intended to remove the need for humans, but instead we’ve made a new, different joint human-computer doodad.4

Automation Creates New Problems

When we set out to make our autoscaler or any other automation, our goal was to reduce the effort and/or accuracy of a task. This goal is so strong that we generally miss, or don’t bother to imagine, the side effects that come with the benefits. Adoption of any technology, which automation is a form of, increases needs for coordination, creates new situations, and new failures. “It changes what is canonical and what is exceptional.”5

Consider our earlier example. Not only do we have new failure modes, we have additional process and state that humans must internalize. Before the autoscaler we had one set of problems, now we have exciting new problems!

Automation Increases Complexity

The productivity gain from automation is tantalizingly quantifiable. The ramifications are frustratingly qualitative. The time spent staring at charts, editing files, and executing changes can be added up in a spreadsheet and celebrated at review time. What will we do with all this extra time?

We create even more complexity, that’s what. We’ll go and automate another thing, or create a new thing that needs automation later. This is a form of The Law of Stretched Systems6:

Every system is stretched to operate at its capacity; as soon as there is some form of improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity.

We’ll take this newfound free time and permission to make more complexity without realizing we’re going to pay later.

Harkening back to our earlier problems we can also look to cybernetics for the Law of Requisite Variety which warns us that a controller — which is what our automation is — must have at least as many states as the system it controls. This culminates in a combo finisher by Brian Kernighan’s famous admonition that debugging is twice as hard as programming. Can you debug this automation if it’s more complex than its target process? What about when multiple pieces of automation start interacting?

Automation Is Design

Automation usually begins from a point of frustration. Our autoscaler was likely born either from an incident remediation or someone who was sick of staring at charts. Our aforementioned quantitative improvements spur us into action. The autoscaler is only a state machine, right? You’ve written a zillion of those!

The repercussions of adding automation warrant research, user interviews, collecting feedback, and all that other work that isn’t coding. I’m talking about design here.

Design is never neutral, so every change you affect or error you emit benefits from design.7 How will users know the autoscaler has taken action? Will the autoscaler make correct choices when faced with increasing latency and capacity for other functions? Can users disable the autoscaler? These are all essential questions to factor into your design.

Automation Reduces Optionality

Humans have created some beautiful — and some despicable — things. A human as part of a system means that system can still be adaptive1. When we remove humans we remove this adaptive capability. Yeah, yeah, I know we have ML and AI but these are, for now, very crude in comparison to humans. Using them is, in effect, even more automation that we must understand. Eep!

To automate a process requires a very specific, fixed set of instructions. Do you understand the process and its ramifications well enough to do that yet? Automating a process requires design and choices, which can reduce the freedom of afforded from continued learning, evolution, and adaptation from the human operator.5 Doing this too early can result in shortcomings, bugs, and technical debt.

A human is aware of seasonal differences, like Black Friday, where a human would temper their actions. A human would recognize a network outage and not scale the compute down to 0 when the latency metric is 0 or missing. These lessons must be learned before we can rely on automation, lest we realize the repercussions in embarrassing incidents.

Automation Is Brittle And Dangerous

You’re still here reading, so you didn’t bail early. Those that did probably think this post is some sort of Luddite position that we should stop or cast off automation. I’m ok with that assessment if it slows engineers down and encourages them to think through when and how to automate something. By all means read this half way through and talk shit about it, so long as it scares you.

Really, the opposite is true. I’m in awe of our automated accomplishments. My issue is with the wreckage we leave in our wake in the form of half-ass resilience. Automation is incredibly powerful, but so is human capability. Deploying automation too soon can result in a rickety, dangerous foundation that humans prop up with grueling on all schedules and unhappy customers.

The next time you feel the urge to automate, instead begin a design document. Better yet, marvel in your own antientropic powers and keep learning so you can write a better design document later.

This post is part of series on automation, the result of many months of reserach and reading. I may adjust these posts as my research grows. If you've got comments leave feedback at the bottom!

Thanks to Arijit Mukherji, Franklin Hu, Jay Shirley, Rajesh Raman, and Sam Boyer for their feedback and reviews of these posts.

References