Last week at Tyro we had a fairly serious production issue. Thankfully, the impact was nowhere near as serious as the kind of outages that most Aussie banks have delivered over the last couples of years; our merchants could still accept payments and they could access their reports, but our backend processing was banked up such that their transactions weren’t appearing on reports for about an hour. That might not sound too serious, but our merchants are accustomed to seeing transactions appear in their reports pretty much instantly, so when our monitoring told us we weren’t delivering on that expectation we considered it a serious incident.
There was lots of good news out of this. Dev and Ops rallied as one team. We fixed the problem, and we managed to fix it without deploying a code patch. We learnt about an important performance restriction in our system that was fixed the next day and gave us knowledge that we can use to improve going forward. And we managed to get it solved before the last bus on my route for the night.
The bad news was that it took us a long time to get to the good news: it was about nine hours from the first indication of the incident to when we finally executed the winning solution. Looking back, I feel a bit stupid that we didn’t – that I didn’t – solve it in a quarter of that time. All the information we needed to lead us to the solution was staring us in the face, right from the beginning.