Last week at Tyro we had a fairly serious production issue. Thankfully, the impact was nowhere near as serious as the kind of outages that most Aussie banks have delivered over the last couples of years; our merchants could still accept payments and they could access their reports, but our backend processing was banked up such that their transactions weren’t appearing on reports for about an hour. That might not sound too serious, but our merchants are accustomed to seeing transactions appear in their reports pretty much instantly, so when our monitoring told us we weren’t delivering on that expectation we considered it a serious incident.
There was lots of good news out of this. Dev and Ops rallied as one team. We fixed the problem, and we managed to fix it without deploying a code patch. We learnt about an important performance restriction in our system that was fixed the next day and gave us knowledge that we can use to improve going forward. And we managed to get it solved before the last bus on my route for the night.
The bad news was that it took us a long time to get to the good news: it was about nine hours from the first indication of the incident to when we finally executed the winning solution. Looking back, I feel a bit stupid that we didn’t – that I didn’t – solve it in a quarter of that time. All the information we needed to lead us to the solution was staring us in the face, right from the beginning.
How did a team of very smart guys manage to take such a long time to solve something that, in retrospect, was so easily solvable? Did we get caught in a trap? If so, how can we avoid it next time? We held a retrospective after the dust settled and came up with a number of good ideas for improving our response to critical incidents. As I reviewed myself on the way home, I identified three weaknesses in my thoughts and actions during the day that I think prevented me from finding the solution quicker. I’ve summarised them (so I can remember them) as Target, Distraction and Method.
We spent a lot of time looking at the wrong things – no doubt about it. It appeared that our system had stopped processing transactions, and we wanted it to start, so we came up with ideas of how to make it start again. One of them worked and it started, but then it stopped again. So we started experimenting with ideas that we thought might prevent it from stopping again. They didn’t work, and this was a slow learning loop that meant the size of the backlog was increasing.
In our strong desire to get our merchants’ transactions processed, we were asking, “How do we get the transaction processing going?”, but the question we should have been asking all along was, “Why did the processing stop?” We had subtly and unwittingly set our targets on the symptom, but we should have targeted the problem.
To say that we spent all of our time trying to fix the symptoms and none of our time looking for the problem would be inaccurate. I and the other software engineers did spend time looking through our code trying to deduce what might be going on. However, I never felt like I got anywhere near being ‘in flow‘; there was often someone looking over my shoulder, or asking me questions, or I was asking other people questions, or there was a conversation about the current status going on just within earshot. I would guess that the longest period I spent looking at code all day was five minutes – hardly enough time to formulate one hypothesis and test it by walking through the code, let alone testing a whole suite of theories.
I wrote earlier this week about my unpatented but still pretty effective method for debugging, which basically boils down to: identify untested assumptions that could be the cause and test them. Did I use my tried and tested debugging method when we were dealing with this production issue? Nope. Why not? I’d say it was partly due to the distraction, but I also think there was a pressure that changed my psychology, pushing me into a “fight or flight” response such that I never got around to thinking logcially about the problem we were facing.
Post hoc ergo propter hoc
I guess you want to know what the assumption was that we failed to challenge early on. Well, about half an hour before our transactions stopped processing, there was a small spike in an unrelated event type on the system. The spike resolved itself quickly, so we focused our attention on the pressing transaction problem because we knew the two issues were unrelated. You see what we did there? We assumed that what we thought we knew about our system – that these two processes were not connected – was true, but without testing it. The truth was, of course, that they were connected, that their connection was the very reason for our transaction processing halt, and if we had looked for that connection at the very beginning we could have resolved our issue much quicker.
Experience by itself teaches nothing
You don’t learn from these things just by being there. You have to take the time out to reflect and assess, then consider alternative realities that might have been nicer (or worse!) had you acted differently. For me personally, I came away with these things that I’d like to do differently the next time I’m dealing with a production incident:
- Target the problem instead of the symptoms
- Take a time-out from all the distraction to think deeply and get into a debugging flow
- Use my normal debugging method, not some untested panic mode variant
- Don’t assume that other unusual behaviour in the system is unrelated
In our team retrospective for this incident, we spent quite a bit of time discussing the ‘distraction’ point above. Our CIO reminded us of the scene in Apollo 13 where the CO2 filter is running out in their lunar module lifeboat and a bunch of engineers are practically locked away in a back room with a problem: figure out how to build a new filter out of a random assortment of space travel paraphernalia. The engineers were removed, both physically and by lack of communication, from the pressure, hubbub and distraction of the command centre so that they could get in flow and solve a problem that needed deep thought and free experimentation.
We decided if it’s good enough for NASA, it’s good enough for us, so the next time we have a production incident our plan is to split into two teams: the Ops team and an engineer or two will continue monitoring production, trying to get more information and experimenting with possible quick fixes, while another group of engineers will be “locked away” to think about the problem with as little distraction as possible. I think this is a great idea and I’m keen to report back how it goes, but don’t hold your breath: this was the first incident of any significant magnitude that we’ve had in about four years.