Lately the job boards have been filled with ads that look something like this:
Seeking Senior DevOps Engineer
* Must be able to debug all databases created since 1980
* Be a core contributor to at least 10 open source projects
* Have experience with Go, Java, Python, Ruby, and C#
* Understand the kernel and be able to debug panics at 3AM
* Be willing to participate in the on-call rotation
* Insert some other absurd skill here!
Am I the only one who thinks this is a joke? Hiring one person isn’t going to make a DevOps implementation successful. DevOps is a shift in how work is flowed to various parts of your orginization, and who should take responsibility when certain actions occur. It is the next evolution of the cross-functional team.
Why is it so difficult to have success with an idea that seems so simple? At Rally we’ve found that it stems from two issues:
- Development teams are asked to own their services in production, but lack the neccessary access to resolve problems.
- Operations teams are very interrupt-driven, so asking them to build tools and systems for other people is never their top priority.
It’s taken us some time to find the right solution to these two problems above, and while our solution may not be right for everyone it’s worth consideration in most engineering organizations.
Iteration 1: Embed a Core Ops Member on a Dev Team
A few years ago, when we started moving towards a Service Oriented Architecture (SOA,) the decision was made that development teams should “own” the code they put in production and therefore should be “on-call” in case an outage occurred. Great idea, right? Systems administration isn’t going to know how to fix a problem that keeps happening every night at 2 AM, but the developer who wrote that code will. So let them get the alerts and provide them with the mechanisms to help debug issues as they occur in production.
However, due to orginizational constraints, we did not give the developers production access. So we effectively tied their hands behind their back while asking them to tread water in the open ocean. There was no way this would ever work. We needed to provide certain tools that would help the developer understand what was happening with their application. So we increased our metrics-gathering and logging capabilities and thought all would be well with the world.
At the same time, the team building this first service needed to spin up new hosts and get them configured in the same way as our other production hosts; another task they could not complete. Furthermore, our operations team was busy handling other support cases for our main application (ALM) which was more important than spinning up new hosts for some new service (that wasn’t even production-ready yet.)
So our engineering leadership made the decision that a core operations member would go and work with this service team and provide them with the needed system administrator skills to get their application into production. He would help bring up new hosts, get them configured (which was done by hand at the time,) set up deployment pipelines, and help debug issues that occurred in production while the application was being dark-launched. This approach worked well. The team found they had a lot more throughput and could experiment without having to bother several other teams to get the work done.
But ultimately this iteration wasn’t tenable. We couldn’t scale having an operations team member embedded on every service team we created. Plus, those teams were focused on delivering features, not solving the real problem at hand: automation and lack of tooling for production.
Iteration 2: Tooling Team, Take One
So we formed a new team that would work on speeding up the delivery lifecycle for the engineering orginization. The thought was that by giving this team a very specific goal, they would be less likely to be interrupted by other tasks.
We couldn’t have been more wrong. In fact, this team found that 80% of the work done was interrupt-driven, and therefore we could never accomplish the task we set out to do.
You’re probably wondering: Why was this team interrupted so often if it had such a specific task to perform? Well, part of it was the team make-up. One member of our team was the sole person responsible for maintianing our build infrastructure (the machines where our CI jobs were executed.) This meant most of his time was spent debugging issues around that system, instead of helping the team. In hindsight we should have found someone else to own that infrastructure (which we’ve now done) but when we were doing this it was hard to justify pulling someone off another team to maintain the machines when we already had someone.
Another reason we were interrupted was that we pushed out the first iteration of our configuration management solution way too early, and the teams that chose to use it were constantly finding problems. We’d have to drop everything and rush to fix their pipelines, so they could get their builds out to production. This sometimes took days or weeks, depending on the sitution. We also spent a ton of time trying to automate the configuration of systems that really didn’t need to be automated in the first place.
Iteration 3: Devops Team, aka SysAdmin for Hire
During the iteration 2 experiment, we also hired several system adminstrators in our remote offices to help facilitate the production tasks for the service teams. They worked outside of the core operations team and were often divided among multiple teams (unlike our iteration 1 experiment.) This was really unfortunate for the team whose work needed to be performed in production, but whose “DevOps” person was working with another team. Since their tooling support was incomplete, the iteration 2 team often felt blocked.
This had an unintended consequence: there was automation support for their service, in a silo very specific to their stack. Their “DevOps” person would help them operationalize that tool, so when they needed to perform a specific task they’d just run a Jenkins job or execute some CLI task. This was great for them but made it hard to reuse the scripts for other teams.
The “DevOps” team members were unable to spend their time buiding generic tools for their teams (or other teams for that matter) because they were now the point of interrupt for production outages and other tasks. Fighting fires became their fulltime job.
Iteration 4: Merge the Tooling Team with Ops
Eventually we realized that the tooling team from iteration 2 was building many of the things that our core operations team had wanted to build for years. So we did what every other engineering department would do: we merged them, and created a new team called Infrastructure Engineering. This team’s goal was to build reproducible infrastructure using tools like Chef and Docker, facilitating the goal of delivering services to production faster. It would ensure that all tasks were automated and had sensible UIs for developer interaction (whether CLI or web.)
But here, all we did was take two teams that were already experiencing high interrupt rates and physically relocate them next to each other in the office. We did nothing to address the interrupt-driven lifestyle that had become commonplace among both teams. The utopia they thought would occur as a result of the merger was quickly diminishing. After several developers left the team for various reasons, it was time to reevaluate the problems.
- Our core operations team is specifically an interrupt-driven team. They fight the fires that other teams cannot on their own.
- Developers were being alerted or paged when their applications failed in production but didn’t have appropriate access to fix the problems, which caused more interrupts for the core ops team.
- Compliance requirements were going to eat up even more of the core operations team time.
- More and more services were coming online and we did not have the correct automation in place to make this easier (spinning up new VMs, for instance.)
Iteration 5: The Hard Decision
During our Q1 PSI planning at the beginning of this year, our engineering leadership made a very hard decision. We would divide the Infrastructure Engineering team into three parts: a tooling/platform team, a compliance team, and an interrupt/fire-fighting team. Siloing the interrupt work would allow the other two teams to actually complete the work that was needed by the end of the quarter.
Building a tooling team free of interrupts, and with a focused product (our first iteration of a Platform as a Service, or PaaS,) meant we were able to accurately predict and prioritze work in our quarterly PSI planning. We broke down our customer needs and set goals for each iteration on things we wanted to deliver. This brought a renewed sense of passion to the team and we’ve been crushing it ever since.
Now: Why do I feel this is the proper implentation of DevOps, vs. what we were doing before?
Our Developers Are More Efficient
The goal of every engineering orginization should be to make its developers more efficient. We did this years ago with the advent of TDD and automated testing. While you may still manual test your applications, the coverage required of your QA personel is drastically reduced because of the automation I hope you have in place. This makes your entire team more efficient, allowing them to increase their throughput.
Now ask yourself: Are there other areas of automation you wish you had in place that could make your development teams more efficient? This is why you need a tools team. There’s not enough time in the day for your developers to both create the tools they need and crank out features. So you have to decide: which would you rather have?
Your Ops Team Has Other Problems to Solve
Your operations team is a fountain of knowledge that’s been shaped and molded over years of midnight pages and one-too-many weekend alerts. They possess crucial information about the state of your infrastructure, and ideas to make it better. Tap into that knowledge and allow your tooling team to build operation tools that help them automate their day-to-day workflow so they can focus on building you a better system.
Your Ops and Development Teams Should Already Be Communicating
These teams should be focused on real problems: How can I effectively scale my application? Do we have enough bandwidth for a given service? What happens when this service increases the database load? While tooling can solve some of the problems your development teams face, it’s often not enough. Your development teams should be working closely with operations to solve application and system problems that are occuring in your environments. This is the value you want them delivering.
There are probably people on your development and operations teams who are passionate about building these tools. Talk with them and find out what they would do to help your orginization, because I bet they go home at night and think about these problems. You should be harnessing this energy, and this is how you get started. Find a few things you can do quickly that will provide immediate value to your teams, and let them work on these. Then watch how the effect cascades and how those tools speed up other areas of your development cycle.