I love conversations around High Availability. I have often had discussions in my classes and in my consulting engagements around the definition of availability. It is a confusing topic. There are typically several definitions, but the main two are:
- The services must be available when users need to consume them. That means that if the users are not online, it is OK if everything is down for maintenance and it won’t count against the availability numbers. This usually also includes maintenance windows as being exempt since users are told that the systems will be down.
- The services must be available all the time.
I tend to lean towards definition number 1. However, I certainly understand that more and more businesses are now 24/7 and even if they are not 24/7 as far as customer facing, they still have 24/7 needs as there are many automated processes. If you think about it, even backup times are considered production times. We can’t have systems down during the backup windows or our backups will fail, and we will be exposed to significant risk until we can get a current backup.
Today, I posted on my Facebook account, and it created the following thread. FYI, I removed a couple of comments, but I thought these ones were perfect:
Russ Kaufmann Based on there being around 30,000 commercial flights per day in the world, if airlines met the 99.999% standard, there would be 109 crashes per year.
Matthew Roche Only if you define “failure” as “crash.”
Russ Kaufmann I define it as “down time” during production (in the air) times.
Matthew Roche It seems to me that a more fair approach would be to consider significant (with this term being defined by the SLA) delays and cancellations as being “down time” in this context, because the service being provided is not available.
Russ Kaufmann Sorry, that is not how we measure availability. Either a system is available, or it isn’t. The services must be available during production hours.
Perhaps, we need to focus on there being approximately 5,000 commercial airliners in the air …at any one time and run our numbers on them. That would significantly reduce the expected failures of 99.999% availability.
I have never talked to a C-Level officer and said, “Yeah, but the stock trading software was back up and running within the terms of the SLA, so you can’t say there was any unavailability.”
That won’t fly. 🙂
Matthew Roche So can I paraphrase you as saying that the only time a flight is unavailable is when the plane has crashed?
Russ Kaufmann If I am consuming a flight (on board), then yes. After all, it isn’t a failure where the service is unavailable.
OK, maybe that is too stringent. How about loss of control of the air craft being added to the list? I would also add cancellations per your earlier statement as that would mean that the flight is unavailable.
Matthew Roche There we go. So “if airlines met the 99.999% standard, there would be 109 crashes, cancellations or comparable service interruptions per year.” Is that accurate? Is that what you’re trying to say?
Russ Kaufmann You took all of the fun out of it. LOL
Matthew Roche I took 99.999% of the fun out of it.