Notes from YOW! 2013: Michael T. Nygard on ‘Five Years of DevOps: Where are we Now?’

Share Button

I attended Day 1 of YOW! Sydney 2013 and thought some people might get something useful out of my notes. These aren’t my complete reinterpretations of every slide, but just things I jotted down that I thought were interesting enough to remember or look into further.

Small child in a field looking into the distance with binoculars, as someone surveying the current state of DevOps probably wouldn't do.Michael T. Nygard (@mtnygard) is probably best known as the author of the 2007 book ‘Release It!‘, which teaches developers how to look beyond just getting their code working and instead design it from the outset to handle the harsh conditions of production environments. He has since become a DevOps luminary and now works at Cognitect. He spoke at YOW! 2013 about ‘Five Years of DevOps: Where are we Now?’.

Michael started off setting the timeline by pointing out that Chef and Puppet were preceded by CFEngine in about 1993!

He explained how his own experience has contributed to his DevOps insight: he worked as a Dev in Ops for some time, showing the Ops team how to solve some of their problems with Dev-like approaches, but also finding lots of problems with the way Devs created software, which was the chief inspiration for his book.

He discussed how John Willis (@botchagalupe), who he called “The Deming of DevOps”, tried to distil the approach of DevOps into an acronym and came up with CAMS, standing for: Culture, Automation, Measurement, Sharing.

Michael said he believes that, for a DevOps culture to take over, it needs to be focussed on enablement, not self-protection of teams. He posed these great questions:

What if the Ops team were measured on how fast software can be delivered to Production?

What if the Dev team were measured on the availability of production systems?

How would this change the approach that each team took and the things they prioritised?

He said that the DevOps approach to measurement that has evolved is to measure EVERYTHING so you can figure out what’s interesting after the fact. He suggested even measuring how many pizzas are being ordered on the company’s tab, which seems silly at first, but could be a leading indicator of turnover due to an increase in overtime.

He talked about Donella Meadows’ leverage points in a system and the importance of changing the structure of information flows, by which she means that changing (typically increasing) information flows alone can change behaviour. As a practical example, a wallboard of broken builds may help reduce the occurrence of breakages or the period for which breakages are left unfixed.

He echoed John Allspaw’s (@allspaw) great idea of doing the things that hurt more often so that you’re forced to address the source of pain. We used this to great effect a number of times at Tyro. One of the most significant examples was when we moved from four-weekly deploys back to two-weekly deploys with the intention of making our deploy pains more frequent, which in turn gave us the motivation we needed to automate away the majority of the (manual-step-induced) problems.

He highlighted the need to be careful not to blame people when they just happened to be present when a failure-inducing system hit the accident point.

“If you’re going to introduce DevOps to your org, start with continuous delivery”

Michael emphasised that the Continuous Delivery book was deliberately not called Continuous Deployment, because achieving continuous delivery means addressing all the upstream processes as well, not just the act of deploying.

He then started talking about the state of the art and where the movement needs to be growing next, starting with the advice: “If you’re looking for places to contribute to DevOps, please do not write another configuration tool.”

Answering the title question “Where are we?”, he gave the following rating for some aspects of DevOps:

Automated deployment is well advanced: A-

Automated provisioning, not so much: B

Logging is being addressed really well (Splunk, Logstash, Kibana): A+

Monitoring is really good (Nagios): A+ (From my understanding, I don’t think there’s many Ops in the industry who would agree that monitoring is a solved problem. Have a look through “The State of Open Source Monitoring”)

Anomaly detection: C

System comprehension: D+  It’s still not easy to view a live topology and see the effects of potential outages on features.

Michael suggested that, if we wanted to start making a technological contribution to the DevOps movement, the last two problems would be really great places to focus.

Something else that’s seen some great leaps recently is “Antifragility” : the practice of improving systems by deliberately injecting random instability into them, e.g. Netflix’s Chaos Monkey. As a simpler example, he pointed out that deployments look like down time, so doing them more often improves resilience to partial failures.

He brought up the concept that “Development is production”. By this he means that, as you move towards continuous delivery, a continuous flow of value through your system, the systems and tools used for development of the system become as mission critical as the systems and tools running the production system.

He believes “statistical sophistication” is improving (though I imagine people’s ability to say that phrase is not!). As an example, many people now recognise that 95th and 99th percentile response times are more important in determining user experience than average times, and people are getting the gist that they can learn a lot more by looking at distributions than just at summarised numbers.

He cited Dan McKinley’s blog post about effective web experimentation, where Dan pontificates on the outcomes of an A/B test that showed a 5% significant change in some metrics, before revealing the two sides of the A/B test had no difference at all. The lessons? If you collect enough noise, some of it will always look like signal. You need to know how you’re going to measure success before you start looking at your stats and trying to build a story.

Michael then talked about some things that are not going so well…

He said DevOps is probably near the “Peak of Inflated Expectations”, as named by Gartner’s Hype Cycle, which means the next few years may well see a descent into the “Trough of Disillusionment”.

He said there is starting to be a lot of commercial interest in DevOps, so we should expect to see “Agile Cloud DevOps Edition”s of lots of tools soon.

What might be the take-away message for a lot of people:

“The ‘DevOps group’ is not a thing, and should not be a thing.”

This resonated a lot with my own review of the crux of DevOps: it’s not about creating a new team of people to solve everyone else’s problems, it’s about everyone undergoing a culture change to solve our problems together.

He talked about the “paradox of automation”: that as we automate more, we start losing some of the skills needed when the automation fails. I don’t remember him connecting this in the talk, but one obvious way to address this is through practice, using the anti-fragility techniques he raised earlier to practice handling intentionally-created disasters during normal business hours.

He also talked about the system complexity that arises when automated features start relying on each other, which can lead to strange outcomes. I think he shared this example of two Amazon Marketplace sellers who both had algorithms increasing their prices of a text book to the point where one of them was offering to sell it for $24 million.

Michael said that ITIL is not inherently incompatible with DevOps, but ITIL processes sometimes are.

He said some IT groups are feeling under threat from “cloud”, and that can make them defensive against ALL suggestions for changing processes.

He took a few questions and someone asked about the effectiveness of blameless retros. Michael conveyed that he had read that the safest hospitals are often the ones that have the highest number of accident reports. This paradox happens because having a culture of reporting near misses and dealing with the inadequate circumstances early results in the hospital improving processes before real accident occur.

Want to learn more?

The slides from Michael’s talk are available on the YOW! website.

Check out these books to continue your journey towards a DevOps culture…

Image credit: ‘Kenzie using her binoculars to survey the land‘ by Dirk Dallas

Share Button

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.