Episode 1: Creating a roadmap

By | 2nd January 2019

With the introductions of the series out of the way, let’s set some goals for being able to deploy on Friday with confidence. Regardless of your current situation (how great or how bad it may be), in the end our goal is something that looks like this:

Main goal

I want to deploy something on Friday and have a weekend without the pager going off.

That’s concrete enough for a high level goal, but we should be a bit more specific if we want to turn this into something that’s actionable and measurable. So let’s make this goal be a bit more specific by adding a few sub goals. Did you also spot the big assumption in this? Keep on reading if you didn’t.

Goals

  1. I want to deploy something on Friday.
  2. In case of a problem, I want to be able to detect and fix it before the end of the day.
  3. In case of a problem, and it can’t be fixed before the end of the day, I want to be able to rollback.
  4. In case of a problem, I do not want my entire customer base to be affected by it.
  5. In case of a problem, I do not want any other parts of my organisation affected by it.

The breakdown

Let’s break these goals down one by one. Goal #1 is pretty straightforward though by itself not very useful. You can probably already do this (but don’t want to because of all the fallout). The more interesting things start to happen at goal #2. There’s quite a few things packed into that single sentence:

  • It assumes things will fail.
  • It assumes we’re able to detect a problem on a very short notice and act upon it.
  • It assumes we’re able to get a second release out before the end of the day. And maybe more than one if needed.

Goal #3 certainly doesn’t make things easier:

  • It assumes we can do a rollback to a previous version in a very short time frame. This includes scenarios where there was a database or API change. It also assumes step 2 failed (hint: making a system robust and resilient is about designing it for failure).

Or Goal #4 for that matter:

  • It assumes we can deploy, in production, to only a fraction of your customer base, instead of a big-bang scenario. This is not necessarily about preventing issues, but limiting the blast radius should issues occur that we didn’t think of.

And Goal #5 is downright tough:

  • It assumes we can isolate the rest of the chain from any issues we might be causing. Depending on your situtation, that could be extremely tricky, but probably not impossible. Again, this is not about preventing issues per se, but about limiting the blast radius if there is an issue that nobody anticipated.

In any case, if we rewrite these goals to something more actionable we end up with something like this:

  1. Decrease our time-to-production for a single release up to the point where we can deploy multiple times per day.
  2. Increase our ability to be able to detect any issues with our software or the environment it’s running on.
  3. Create a rollback capability so we can always go to the -1 version, regardless of database and/or API changes.
  4. Create the ability to roll out our software in production in a gradual way, rather than big-bang.
  5. Limit/mitigate the impact of our failures on systems that depend on our application.

Things will fail due to changes not initiated by ourselves

Up until now, we’ve only really talked about deploying our own software. Our main goal even specifically mentions about not having the pager go off after we deploy something. Remember that big assumption I mentioned at the beginning? Could things fail without us deploying anything? Can the pager go off even if we don’t deploy anything? Yes it can! And if you have a fairly complex environment to run in, chances are that the chance of something failing because of something that’s initiated outside of your team is probably bigger than when you deploy something yourself. So we’ll add this to our set of goals.

Things will fail even when nobody changes anything

So what if one of our dependencies deploys and breaks something in our application? What if there’s an OS patch exercise in the weekend and our application no longer starts or becomes non-functional in some other way? What if nothing is deployed by anyone at all and there’s still something that fails and causes outage (e.g. a memory or other resource leak)? Can we protect against that? Yes, in a lot of cases we can! And in the cases where we can’t we should at least get a proper alert.

Assume everything will fail at some point

Making your own deployments boring and a non-event is only half of the puzzle. But you also need to isolate yourself from failures of your dependencies. Note that the word used here is ‘dependencies’ and not ‘applications’. Because networks can fail, hardware can fail, patches and updates to said networks and hardware can fail. Everything in your architecture can fail quite hard without anybody deploying anything, and it’s not necessarily always software. So not deploying on Friday is definitely not a guarantee that nothing will fail and you won’t get a call in the middle of Saturday night. We need to guard against failures not by designing against them but by designing for them.

Things will fail in ways you never imagined

Things will fail. Things will fail in the most spectacular ways possible. Things will fail in ways you could have never even imagined. That’s why observability is such an important topic, as well as having a defensive, robust and resilient architecture. If something fails, we want to know as quickly as possible and we want the blast radius to be limited, so even if there is an issue we didn’t think of, impact will be small.

Back to the roadmap. Our complete (at least for now) roadmap of goals we’d like to achieve to be able to deploy on Friday with confidence looks like the list below.

Revised set of goals

  1. Decrease our time-to-production for a single release up to the point where we can deploy multiple times per day.
  2. Increase our ability to be able to detect any issues with our software or the environment it’s running on.
  3. Create a rollback capability so we can always go to the -1 version, regardless of database and/or API changes.
  4. Create the ability to roll out our software in production in a gradual way, rather than big-bang.
  5. Limit/mitigate the impact of our failures on systems that depend on our application.
  6. Limit/mitigate the impact of other systems failures on our application.

Now let’s add some actions to each of our goals so we can get the roadmap complete and get this blog series really kicked off! For the sake of clarity I’ll give each action/topic a clear name and description.

Topics/actions per goal

  1. Time-to-production: reduce the complexity and size of our releases. Automate the entire CICD pipeline as much as possible. Eliminate waste and wait times in the pipeline.
  2. Observability: Define key metrics and KPIs for every feature you release. Implement these metrics in a monitoring solution of your choice and have a visible dashboard for it. Automated notification and alerting.
  3. Rollbacks: Implement fully automated rollback testing in your CICD pipeline to verify backwards compatibility of database and APIs.
  4. Canary releases: Implement fully automated canary releases for your application and test these in your CICD pipeline.
  5. Consumer driven contract testing: Implement fully automated contract-based testing in your CICD pipeline, both upstream and downstream. Explicitly implement the possiblity of failures or no-responses in these tests.
  6. Resilience: Implement circuit breakers and/or bulkheads on every dependency. Define fallbacks for every outgoing call/message. Assume everything will fail at some point. Fully automated testing of failure scenarios.

Let’s start off by saying that there is not a right or wrong way of traversing through these topics. Depending on your situation one topic might be more important to pick up than another, but it might be different for another team or organisation. Also note that a lot of these topics are not just about throwing some technology at a problem, they also involve changes in the way of working, team composition and culture. This blog series will focus mostly on the technology side, but context about the other areas will always be given and in many instances, be worthy of its own explanation/post.

My problem is not on this list

Last, but not least: this is not an exhaustive list. Depending on your personal situation, some of these may not apply at all and some may apply that are not in this particular list. For your personal situation there may be other topics that are more important to solve right now. However, increasing your development speed while at the same time increasing resilience and observability should get anyone into a better position from where they are now.

Up next

For the next post I’ll start with #2: Observability. It’s important to start with having visibility on how your application is behaving in production as-is. You can’t improve from situation A to situation B if you don’t know where situation A is and whether you’re moving in the right direction or not. Having KPIs, metrics, SLIs and SLOs in place is essential to being able to improve. We’ll talk about theory, process and technology, with special focus on specific solutions and code.