When C3PO told Han that the possibility of successfully navigating an asteroid field was very small, Han replied to never tell him the odds. He then proceeded to successfully navigate the asteroid field.
While this is of course fiction/fantasy there could be a lesson in there with regards to designing reliable software systems. A lot of times we like to think that just because something has a very small chance of happening it will never happen and thus not cause any major issues. “This will fail once in a million requests? That will never happen!” But is that really true? Let’s look at an example below.
Playing with numbers
Let’s assume an e-commerce website where you take in orders for customers and that there will be a failure once every million requests. Pretty small chance right? Let’s see what this means in the scope of an average website. Let’s assume we have a website that’s pulling a modest 50 requests per second on average. That’s 3000 requests per minute. 180k requests per hour. 4.32 million requests per day.
Uhm, wait a minute! That thing that was supposed to have a very small chance of happening is actually happening 4 times per day! 28 times per week! 120 times per month! 1440 times per year! Now let’s say that ‘catastrophic failure’ is a failing order, a.k.a. a customer ordering something on your website and for whatever reason not receiving the goods they ordered. All of a sudden this ‘tiny problem that never happens’ is actually 1440 additional calls/mails to your support desk requiring human intervention. And those are not just expensive; they might lead to your customer deciding to go to the competition and spend their hard-earned money there. Oops!
The lesson: never assume that just because something has a very small chance of happening that it won’t ever happen and not cause massive failures (maybe even cascading failures).
When dealing with large volume or with small volume over a large amount of time, that thing with massive impact that had a really, really small chance of happening? It has a 100% chance of actually happening. As a matter of fact, it’s probably happening all the time. Design your systems accordingly!