Unbreakable or Mr. Glass?
The year 2000 saw the release of M. Night Shyamalan‘s movie Unbreakable, featuring Bruce Willis as the protagonist (David Dunn, a.k.a. Mr. Unbreakable) and Samuel L. Jackson as the antagonist (Elijah Price a.k.a. Mr. Glass). In this movie, David possesses superhuman powers that make him almost indestructible, while Price is the exact opposite with bones that fracture at even the most delicate touch.
Design for failure
In software or systems design, more often than not our systems are like Mr. Glass: fragile (easy to break). Systems are often designed assuming that the design was correct and everything will work as expected. But reality is different. It’s not a matter of if but of when things will fail. In order for us to design systems that are resilient (hard to break) we need to embrace the fact that failure will occur and design for failure, not against it. After all, we don’t want our systems to break after we deploy on Friday, do we? We want them to stay up and even when there are issues they should be of a priority that can wait until Monday.
A story of massive failure
The best way to describe fragility is with a real-world scenario of massive failure. This is not some hypothetical made-up scenario; this scenario actually played out in a complex enterprise environment a couple of years ago and gives a good understanding of what it means for an ecosystem to be fragile and what it means when issues have a very large blast radius.
The failing architecture
Let’s take a look at the high level architecture diagram below for an e-commerce shop setup with multiple channels and points of sale.
There’s the web shop, a typical JEE web application that is used to sell products and services. This JEE application fetches and updates customer information from a customer service and creates and updates orders through an order service. That same setup is duplicated for the physical shop (an actual store, yes these still exist!). The order service ultimately calls a delivery service to have the physical goods delivered to the customer’s delivery address. Customer support also relies on the same services in case customers decide to call customer support rather than visit the website or the physical store to place an order. All connections in this diagram are HTTP (either REST or SOAP) connections except the connections to the delivery service, which is a file export/import. Note that this is just a slice of the overall application landscape, in reality there were many more connections and channels, but for sake of simplicity and readability these are omitted.
Looks simple enough right? What could possibly go massively wrong? Let’s take a look at the chain of events.
Chain of events
One day at 9:00 in the morning, a new product was launched in the web shop and customers were lining up to buy it. This caused the average load to increase to 50x the normal load. The web shop by itself was able to handle this load, but then something unexpected happened. Not only did the webshop become completely unavailable, a large amount of other applications in the ecosystem started to fail as well, leading to massive downtime, customers not being able to order and employees not being able to do their work. So what happened?
Let’s take a look at what went wrong. In the web shop, for every incoming request (1 thread per request model) a connection was opened to the customer service (no thread pools or semaphores in place). So the incoming requests were transferred from the web shop to the customer service at a 1:1 ratio. This customer service was not designed for handling an online load like that (it was designed to be an internal system and dimensioned accordingly). This system became unresponsive. Combined with lack of timeout settings on the side of the webshop this led to two problems.
Because the webshop was opening connections to the customer service and those connections were waiting indefinitely for a response that never came, the number of available threads for handling customer requests in the webshop became exhausted fairly quickly (all the threads were consumed and waiting on response from the customer service). This led to unavailability not just on the order flow, but on the entire webshop (all the CMS pages were served from the same application). So along with the customer service, the webshop also went down completely and became entirely unresponsive.
But the pain did not end there. Because the customer service was unavailable, it became unavailable to all the other applications that were also using it. And none of these applications had any safeguards either. So people trying to use the mobile app because the web shop was down: failure. People calling customer support to order there: failure. People in the physical stores trying to order: failure. In the end, the entire order process in every channel went down. But not just the order process. With the customer service being a critical dependency for a lot of other processes, every single customer facing process went down. All of them.
Note: the order service and delivery services stayed up. Not because these were architected in a better way but because simply nobody was able to progress to the order flow far enough to actually place one. And no orders=no revenue. Ouch!
Because this was a high-end product with a very high-customer value, this was several millions of missed revenue within an hour. Not only that, but the product was subscription based and competitors also had similar offerings. So the customers lost were lost to a competitor, probably forever. On top of that, the disturbance caused multiple hours of outage on about 20-25 systems in total, not just order flow, but also CRM, ERP, reporting and all kinds of other processes. Everything just came to a full stop.
Now that’s the kind of outage that gets both the CEO and the CIO standing at your desk in person. Imagine the panic. Imagine this being a Friday morning (it actually was!). So, could this mess have been averted? The answer with hindsight is of course always yes. In reality these massive failures are never the result of just one thing going wrong in an organisation, but there are multiple causes piled on top of eachother in a perfect storm. So there’s no point in pointing fingers, only analyse what really went wrong and then fix it. And that’s the hard part because the issue kind of fixed itself after load went back to normal and systems were restarted. So all was good now right? Yes. Until the next time, because nothing was really fixed, so the problem will happen again and again until the system is made less fragile by adapting every single component in it. Luckily there are a few patterns that can help with this. They won’t prevent localised problems, but they will prevent these from spreading throughout your whole ecosystem.
Limiting blast radius
The goal here is to make sure that problems that occur local to an application stay local. We want to limit the blast radius of a problem. One application going down is bad enough, but if that application in turn pulls down your entire (or part of your) ecosystem then that’s unacceptable. In our massive failure example one application took down the entire ecosystem which means that the blast radius was massive. Like an explosion going off it caused a shockwave that progressed through the ecosystem, toppling systems one by one. It would be nice if we could somehow stop the shockwave in its tracks and not impact the rest of the ecosystem.
Two patterns come to the rescue: the bulkhead and the circuit breaker. The first one prevents your applications downtime from affecting other applications and the second one prevents other applications downtime from affecting yours.
Bulkhead, real world
What’s a bulkhead then? A bulkhead is a term that originates from ship construction. Wikipedia states the below definition and picture.
The hull of a ship is compartamentalized using upright walls called bulkheads to reduce floodability.
Imagine the ship without these walls. If the front of the ship gets punctured (by let’s say an iceberg) the entire ship will flood and the ship, cargo and any passengers will all be lost. Not good. But with bulkheads in place, only one compartment will be flooded. Whatever is inside that compartment is of course lost, but the rest of the ship is safe. It can continue it’s journey and most of the cargo and passengers will be spared. So the problem doesn’t go away, but it’s confined to only a small part of the system.
Bulkhead, in software
In software, a bulkhead acts pretty much in the same way. It’s a virtual wall that’s erected between components so that the downtime of one component will not affect other components. If there was a bulkhead between the webshop and the customer service in our massive failure story above, the customer service would have not died because the load would be isolated inside the webshop.
So what does that mean from a technical perspective? Let’s assume our webshop is running on Tomcat, and has a connection pool of 250 connections. Our customer service can handle 10 connections before failing. In the unguarded situation, we would open up a connection for every incoming thread, which would kill the customer service whenever there’s more than 10 incoming connections. But if we put a thread pool or semaphore in between with the ability to handle 10 requests, then we can prevent the system from dying.
Note: this will of course result in errors for any incoming requests that exceed the amount of 10! Once the pool or the semaphore has reached it limit, it will simply fail. Is this great? Maybe not, but now you’re still able to serve to the limits of the abilities of your ecosystem and keeping the ecosystem alive. Without a bulkhead this would have immediately led to unavailability.
What’s also great is that this opens up the possibility of detecting the failure and implement a backup path or a nice error message. E.g. in the webshop scenario: when fetching customer information failed we let the user enter the details themselves instead of having them prefilled. The resulting order of course needed extra checking, so we flagged the order so it would drop into a different queue. So instead of simply failing we provided a different (though less optimal) path (maybe with a bit of manual work to check orders, but at least we captured the orders, and that means revenue!).
To summarise: bulkheads allow you to isolate issues in individual components and prevent them from spreading to other applications. They do not prevent any additional problems, but they do allow you to implement secondary or error paths without sinking your entire ecosystem (fail fast, don’t fail slow while blocking resources). This makes your systems more resilient (and robust).
Circuit breaker, real world
A circuit breaker is another term from the physical world, more specifically electronics. Let’s go to Wikipedia again for a real-world definition.
A circuit breaker is an automatically operated electrical switch designed to protect an electrical circuit from damage caused by excess current from an overload or short circuit. Its basic function is to interrupt current flow after a fault is detected. Unlike a fuse, which operates once and then must be replaced, a circuit breaker can be reset (either manually or automatically) to resume normal operation.Wikipedia
You probably have several circuit breakers inside the junction box in your house. If any device in your house short-circuits or overloads, a circuit breaker will flip and interrupt the current, thus preventing human injury and possible fires. An example: last years summer in The Netherlands was very dry, almost no rain. I recently moved into a new house and there are a few sockets outside in my yard. One of them was not waterproof, but I did not know that until the rain started again in the fall. A loud POP! and power went out. I checked the junction box and noticed one of the circuit breakers was flipped. After correcting the issue (got rid of the faulty socket) I flipped the switch back on and all was well. Without a circuit breaker this could have led to fire, injury or worse.
Circuit breaker, in software
Physical circuit breakers keep electrical problems small and prevent them from escalating and burning down your house or killing people. Which is awesome, but how does this work in software? In software a circuit breaker can be used to interrupt a flow between components.
In our massive failure example the customer service died and as a result the customer support application started to fail as well (and several other applications). With a circuit breaker in place, the customer support application would be able to detect the downstream system having issues and interrupt the flow. This would most likely lead to reduced availability and maybe even errors, but the system would not go down, opening up the possiblity of choosing an alternative path. E.g. maybe it’s possible to fetch customer information from some other system or from a local cache. In our massive failure scenario there were other sources of customer data available (at varying levels of quality) as well as the ability to re-enter the customer details in the customer support application and then postpone order validation until systems came back up. This would not have been a 100% up scenario, but would allow to capture the order (and thus revenue!) and defer processing to a more appropriate time.
So how does that work in software? Every outgoing request will be wrapped and measured on whether it was successful or not. Successful in this case means a 20x HTTP response and on time (based on timeout). If the number of failing requests over a certain time frame exceeds a certain threshold, the circuit breaker trips and either an error is thrown or an alternative path is being selected. The circuit will remain open for a certain amount of time (because it makes no sense whatsoever to keep sending requests to a system that’s already in trouble!) and will then close, allowing requests again. And then the cycle is repeated, so if the downstream system is still down, the circuit will open again, and so on and so on.
Circuit breaker, summary
To summarize: a circuit breaker allows you to insulate yourself from downstream issues and prevents other systems from bringing your application down. Like bulkheads they do not prevent or fix any additional issues, but they do give you the ability to provide an alternative path or a decent error message (again: fail fast, don’t fail slow while blocking resources).
Putting it all together
It’s perfectly possible to combine bulkhead and circuit breaker patterns. Hystrix does this for instance and also with resilience4j this is possible. This makes it possible to have two-way protection on every single one of your downstream calls. Implementing this on all your downstream calls will greatly increase both the robustness and resilience of your applications, significantly reducing the chance of you being paged on Saturday because you deployed something on Friday!
That’s enough theory for now, let’s get to some code. In this example there are two Spring Boot applications: one that provides a REST interface (let’s call this one the provider) and one that consumes it (let’s call this the consumer). The consumer will use the excellent resilience4j libraries to implement the bulkhead and circuit breaker. The source code can be found on GitHub.
In the provider there are three methods implemented in the RestController:
- Method with normal operation
- Method that is slow
- Method that throws an error
The slow method is deliberately slowed down so we can more easily test the bulkhead. The error method simply returns an HTTP 500. This method is used to test the circuit breaker. The application will run at port 8081.
In the createBulkHead(..) method a custom BulkheadConfig is built using a maximum number of concurrent calls of 1. This makes it very easy for us to test the bulkhead locally with a small number of connections. In your production environment this number will be different (I hope at least). With the BulkheadConfig a Bulkhead is created, some logging methods added to its EventPublisher and then stored as an instance variable.
In the createCircuitBreaker(..) method a similar pattern is followed. A CircuitBreakerConfig is created with a failure threshold of 50% and a duration of 20 seconds for the open state. Again, this number is probably different on your production environment, but it makes it really easy to test locally. With the CircuitBreakerConfig a CircuitBreaker is created, some logging methods added to its EventPublisher and then stored as an instance variable.
The RestController gets three methods. The okay() method simply calls a RestTemplate, unchecked whatsoever.
The bulkhead() method calls the /slow method of the provider via the Bulkhead that was created during construction. It does so by decorating the RestTemmplate call. The result is a function that is passed to Try.of(..). This will either succeed, in which case the expected result is returned (“The message was ” concatenated with whatever was returned by the RestTemplate call), or it will fail with an exception, in which case the text “This is a bulkhead fallback” will be returned. As you can see this is fairly straightforward configuration. It’s not an annotation like Hystrix, but it’s very clear what’s going on here, no hidden thread pools (resilience4j uses semaphores) or other under-the-hood things to configure or take into account, just plain code.
The circuitbreaker() method calls the /slow method of the provider via the CircuitBreaker that was created during construction. The pattern followed here is exactly the same as in the bulkhead() call: decorate your RestTemplate, use the .recover(..) method of the Try.of(..) to configure a fallback.
Testing the bulkhead
With the code setup, it’s time for some testing. My preferred weapon of choice is good old ApacheBench, aka ab (or in the case of https, abs). This is a command line tool like curl that allows us to specify a number of requests, but also a concurrency level, which is just the thing we need to test a bulkhead.
First let’s test with a concurrency level of 1. We expect every single request in this case to succeed, since this is the maximum concurrency level we’ve configured in our bulkhead. We’ll send ten requests. The command to execute is ‘ab -c1 -n10 http://localhost:8080/bulkhead‘. When checking the stdout of the consumer application, we see that all requests are being permitted.
Now let’s try with a concurrency of ten. We execute the command ‘ab -c10 -n10 http://localhost:8080/bulkhead‘. Now we see that actually 2 requests are permitted and 8 are rejected.
I’m not 100% sure why it’s 2 instead of 1, but all requests are fired within the scope of 1 second (the minum duration of the /slow call) and if we increase maxConcurrent to 2, it will permit 3, when we set to 5 it will permit 6 and so on, so I assume this is either a bug, intended behavior or some weird error from myself. For the example it doesn’t really matter.
It’s obvious that adding the bulkhead is preventing our application from overloading the system it’s calling and that it executes the fallback in case the threshold is crossed. You can check output visually by sending a large number of requests with ab, e.g. ‘ab -c10 -n10000 http://localhost:8080/bulkhead‘ and then open up the same URL in a browser while it’s running: you’ll see the fallback message.
When we send the same request when ab is not running, we get the correct message again, demonstrating that the bulkhead is working correctly.
Testing the circuit breaker
We want to verify a couple of things in this case. Notice that the /circuitbreaker URI on the consumer has a flag. If shouldFail is set to true the ‘/error‘ URI on the provider is called, otherwise it will call ‘/‘. This allows us to open/close the circuit breaker by using the flag.
In this case we’ll want to verify that the circuit breaker will open in case there’s a lot of errors, then closes for 20 seconds, then opens again if requests are succeeding (otherwise stays closed). I’m disregarding half-open state in this case, but feel free on reading up on how this works in the documentation (links at the bottom).
First we’ll send a bunch of requests with ‘shouldFail‘ set to true. This should open up the circuit breaker as we can see by executing ‘ab -c1 -n100 http://localhost:8080/circuitbreaker?shouldFail=true‘ This will yield output such as below.
All calls are failing, and once the threshold has been passed, the circuit breaker will go into an open state and calls will be denied instead and execution goes into the fallback. See the output below.
Note that if we now directly execute ‘ab -c1 -n10 http://localhost:8080/circuitbreaker?shouldFail=false‘ (with shouldFail set to false), but within the time limit of the open circuit duration (remember, we set this to 20 seconds) we still get the denied response. Since the circuit breaker is in an open state, it will not allow any requests to go through. If we wait until enough time has passed and we try again, we get the output below.
Again, if you run these tests and open up a browser window alongside with the working URI (http://localhost:8080/circuitbreaker?shouldFail=false) you can visually verify that when the circuit breaker is closed you’ll see the correct message and when the circuit breaker is open you’ll get the fallback message.
This demonstrates that our circuit breaker is working properly. As stated, in a real life situation you’ll use different settings and numbers and you’ll always need to fine tune your settings to your specific situation in production, out of the box numbers might work against you and not help at all!
There’s a lot to digest here. In short, we demonstrated how bulkheads and circuit breakers can isolate individual components in an ecosystem and thus limit blast radius of issues to single components rather than blowing out half your ecosystem. Maybe it will not completely prevent you from being paged on Saturday, but at least it won’t be by the CEO. 😉
However, there are some caveats. Bulkheads, while simpler than CircuitBreakers, only limit the load of a single application to another one. But what if we scale instances? Or there are multiple applications using that service? How to determine values for your concurrency levels then? There are ways to solve this but that’s for another article. Fortunately, libraries like resilience4j make monitoring (esp. when combined with Spring Boot) easy so you can always monitor, test and refine your settings. I recommend reading the Hystrix documentation as this has a lot of good pointers.
Further reading material
Feedback is always appreciated, either by commenting here or on Twitter or LinkedIn! This was a fairly large article with lots of information, so feel free to hit me up in case of questions or when I made massive failures in the article/code.