When the Production Queue Stopped

Unusual pain

Traffic jam between New York skyscrapersLast week at Tyro we had a fairly serious production issue. Thankfully, the impact was nowhere near as serious as the kind of outages that most Aussie banks have delivered over the last couples of years; our merchants could still accept payments and they could  access their reports, but our backend processing was banked up such that their transactions weren’t appearing on reports for about an hour. That might not sound too serious, but our merchants are accustomed to seeing transactions appear in their reports pretty much instantly, so when our monitoring told us we weren’t delivering on that expectation we considered it a serious incident.

There was lots of good news out of this. Dev and Ops rallied as one team. We fixed the problem, and we managed to fix it without deploying a code patch. We learnt about an important performance restriction in our system that was fixed the next day and gave us knowledge that we can use to improve going forward. And we managed to get it solved before the last bus on my route for the night.

Success… eventually

The bad news was that it took us a long time to get to the good news: it was about nine hours from the first indication of the incident to when we finally executed the winning solution. Looking back, I feel a bit stupid that we didn’t – that I didn’t – solve it in a quarter of that time. All the information we needed to lead us to the solution was staring us in the face, right from the beginning.

Continue reading

My Debugging Secrets Revealed

A large tick wriggling upside-down next to a pair of tweezersI’ve always been pretty good at debugging. Until a couple of years ago I’d never thought much about why I find it easy, but once I realised that I didn’t know why I was good at something, I had to know. So I dedicated some time to analysing my own internal, instinctive thought process, and from what I’ve observed it can be reduced to this:

Continue reading