Every CIO/CTO I have spoken to over the past 12 months wants 100% availability. Why? Because they want to sleep like a baby every night, knowing that their systems will continue to operate even through inevitable system failures. They also want their teams to be able to focus on making functional improvements rather than wasting time firefighting the latest issue.
Why is it that my Linux server at home with one power supply, one network interface and one hard disk has 100% availability over the past 3 years? Why is it that many business critical systems with N+1 fault tolerance struggle to reach even 99.9% uptime?
Of course there are lots of reasons: demand profile, number of changes, human error, and just plain luck that the single hard drive in that machine has not failed yet.
However, it often appears, that complex ‘high availability’ solutions are not necessarily more reliable.
The truth is that there is a difference between our perception of reliability and theoretical or calculated reliability. We perceive the simple, single server as reliable on the basis that it has not yet failed. Whereas, as we add more hardware and software to provide more complex high availability solutions, it sometimes looks like we make matters worse. That’s because we have added more physical components that could fail and more software that needs to be properly configured and managed.
The answer lies in looking at the problem with a time dimension. When complex systems are first installed, they often have teething issues. These can be configuration and set-up related, software/patch levels, new hardware can die after a short time in use, or operators do not yet fully understand how to run the new setup. Soon after any new installation there is often a dip in reliability, in some cases making the whole service worse than it was before! Eventually, with hard work and perseverance the issues get sorted and the system becomes stable. You can speed up the route to stability by employing experts who have done it before or build reference platforms that are vendor tested, but in most cases it will not be zero.
No, complexity is not the enemy of reliability, we just need to make sure we have allowed for the bedding-in period and found a way to run in parallel before pressing ‘go’.
The real question is: Will you be forced to throw a system into live service before it is fully stable?