Welcome to 2019, and welcome to my new website: deployonfriday.com!
So what’s this all about then?
Last year, I noticed the increased popularity of the hashtag #NoDeployFridays on Twitter. Apparently there are a lot of developers out there who are afraid to deploy their software at the end of the work week, or simply don’t feel confident or empowered enough to do so.
And probably with good reason too! Nobody wants to spend the weekend fixing issues when they could be doing something more useful (a.k.a. anything other than work) with their free time, even when they’re on-call.
And guess what: I totally agree with that! If your organisation is not capable of delivering working software (we’ll get to what that means later on in this series) continuously, you should not be deploying on Friday.
But wait, isn’t this site called ‘Deploy on Friday’?
While I agree that you should not be releasing software when you really shouldn’t, I do not agree with the proposed approach of ‘well, let’s just stop delivering software altogether then’. Here’s a few reasons why:
- Saying that you are able to deploy 0% of your software on Friday is a capitulation. It’s essentially saying that 100% of your software is not good enough for release. I can see some cases where it’s too risky, but 0%? Not even 25%? Or 10%? Is your software really that bad?
- It’s a sliding scale. First you’ll stop deploying on Friday, but if you don’t improve your culture, processes and technology your quality will simply slide further and some time after that you won’t be deploying on Thursday either. And so on and so on. So you’ll end up with the infamous Patch Tuesday strategy (or worse). Not exactly continuous delivery or continous deployment, is it? Shouldn’t we be going to opposite way?
- It should not take 24 hours or longer after a release to know if it’s causing massive (we’ll get to what ‘massive’ means later on as well) issues or not. Why can’t problems be spotted straight away or soon after? And maybe on a limited set of customers instead of all of them?
- It should not take several hours (or more) of coordinated manual work to rollback a release or provide a fix if needed. Conference calls with 50 people across continents and time zones in it sound like an efficient way of working?
- It should not be an on-call person’s job to complete missing features, add new ones or rewrite architecture to be more robust. The only goal is to get availability back up to acceptable levels (again, we’ll need to define what’s acceptable). The rest can wait until the next business day.
- You should not be afraid of your own work! You should be proud of it and confident it works (or if it doesn’t, you can fix it quickly)! A deploy should be boring, a non-event.
- And last, but certainly not least: not deploying yourself does not mean you don’t get called. Any hard dependency you have that fails will cause the pager to buzz. So other people deploying/patching on Friday will screw up your weekend too unless you do something about it. And no, the solution is not to halt the entire IT department on Friday.
While this list gives us a good start to identify what is wrong today (symptoms), it doesn’t tell us the reasons why just yet. And we really need to know and understand the reasons why before we can start fixing some of the problems. It’s tempting to think that we can fix all of these by throwing some technology at it, but not all of these problems are technical in nature, so throwing technology at them will fix absolutely nothing. If the problem is in your process you need to fix your process, if it’s in your culture you need to fix the culture and so on.
One very big disclaimer: even after implementing everything that’s going to be mentioned in this series, you still won’t be able to deploy 100% of your software on Friday. But it’s not going to be 0% either. As a matter of fact it’s going to be a lot closer to 100% than it’s going to be to 0% so it’s going to be a massive improvement. It’s not a binary thing, especially when dealing with legacy software with a very long history.
So what are some of the issues then?
Let’s take a look at some of the reasons why your releases are giving you headaches. Keep in mind that depending on your own context and organisation some of these might not apply (and there might be additional ones that aren’t covered here). Any of these sound familiar?
Your software and releases are big and/or complex. Your software development lifecycle is long. Your software has many dependencies and no fall-back if one of them breaks, causing immediate outtage. Your software causes dependencies to have outage. You have frequent outage of the platform you deploy onto. You work with or rely on outdated technology. There’s a lot of technical debt.
There are no KPIs or metrics in place to measure if your software is working in production. There are no KPIs or metrics in place to measure the effectiveness of features. A large part of your deployment process is manual, or waiting time. There are no automated rollbacks. There is no testing of backwards compatiblity of contracts, APIs or database changes. There is testing, but it is mostly manual. There’s CI, but no CD. Outages are never evaluated.
Your team is not self-sufficient/empowered. Your team spends a lot of time waiting for other teams. Your team has no (or limited) say in the contents of the backlog. Technical and/or functional decisions are made by people not in the team, and those people are not on-call either. Developers are not considered to be product stakeholders/owners. You don’t speak to customers on a regular basis. Your organisation works based on projects, not products. Your organisation punishes failures. Focus is on individual tasks rather than coordinated team efforts.
Whoa, that’s a lot to chew on!
Yes it is! But as we like to say in the current company I’m working for: You don’t eat an elephant in one sitting! So let’s start by changing one small thing at a time. Big-bang scenarios rarely work, so let’s actively avoid those. Let’s also not focus on things that are not within our span of control, at least not right away. Focus on the things that you can implement by yourself and in your team today if you wanted to.
So how does this site help?
In every post, we’ll cover one real-world topic/problem with getting software into (and out of!) production quickly, reliable and at quality. We’ll dive into the why of each problem first and then state what’s necessary to fix it, as well as how to measure success (or failure). We’ll focus mostly on the developer side of things, though we’ll touch some of the hard bits in the process/culture department as well.
We are going to cover topics such as observability and monitoring, rollback capabilities, API and database compatability checks, canary releases, automated (contract) testing, fully (or almost fully) automated CI/CD, circuit breakers, 12-factor apps, design for failure, etc.
Short story: we’re going to make things small, measurable, robust, resilient and with limited blast radius. Both for greenfield and existing applications. Pretty soon you will feel confident enough to deploy at least something on Friday. It can be as small and as insignificant as you want. As long as it’s not nothing. Because delivering nothing is not acceptable.
First stop: monitoring and observability. Because if we don’t know how our software is doing in production, there’s a good chance we’ll screw up our weekend when deploying on Friday, and that’s not what we want! We want to #DeployOnFriday with confidence! (Or any other day/time of the week for that matter)