A CEO's guide to Disaster Recovery
A few years ago, I got a call at 3:01am. My guess was that lottery winners only get informed during office hours, so I was expecting bad news! Sure enough, our primary data centre site had gone off-line due to a major power failure. This had taken down both primary and backup power feeds, so recovery was going to take some time. That primary site had been set up with a fault tolerant, high availability (HA) setup with dual servers. However, in this case the entire data centre had failed, so we were beyond normal HA. It is this type of scenario that people normally call on their Disaster Recovery setup.
The company ran critical systems for thousands of customers, so of course we had a plan. We failed over to our 2nd data centre site which was automatically kept up to date using database synchronisation. All would have been fine, but then two hard disk failures took down another server. Thankfully the second server at that site continued to operate, so we kept all systems online for customers. However, the only thing between us and customer facing data loss was the final working server plus the extra backup server I had set up as an additional protection for DR. Despite this It was a stressful night.
This was a well architected setup with multiple levels of fault tolerance, but circumstances had blown through several layers in quick succession:
- Primary site server A - taken out due to data centre power failure
- Primary site server B - taken out due to the same data centre power failure
- Secondary site server A - taken out due to 2 hard disk failures
- Secondary site server B - continued to operate
- Backup server C - continued to operate
Imagine sitting down with your PR or communications team drafting a statement to customers. Which of these two scenarios would you prefer not to face:
A) a brief service outage; or
B) a brief outage, but having lost some customer transactions…irrecoverably!
It’s one thing having service interruption, but quite another facing the prospect of losing any customer data. Whether IT forms part of your product or server, or is a support function to your main business, customers quite rightly need to be able to trust your business. Whereas a service interruption might be annoying for customers, losing their data will certainly hurt your customer retention numbers.
If DR were simply an insurance policy, then the payback would be to save you time on getting systems back up and running, but once you have considered the cost of damage to reputation or brand, you might reassess the risk and business impact of failure. When planning DR expenditure, think this through and make sure you fully understand the potential risks and are comfortable they are appropriate. It’s easy to believe you have this covered because you spent the money on a backup server, but make sure it is protecting you for your worst case scenario.