As an Enterprise matures, there is an inescapable question: How do I recover from a partial (or major) Cloud Provider outage? How do we keep our customers online?
In this blog, I will outline some of the basic principles of Disaster Recovery (DR). Unfortunately, there are two hard truths: every DR Plan is tailored to a particular company and it takes a lot of development time to build it out.
Be Clear on Goals
This may sound simple and obvious, but it is actually a very important decision which will drive the architecture and costs.
Here are some of the key questions:
- What is the magnitude of the Disaster we are designing for?
- A regional US outage? A complete outage all across the US? Multiple countries?
- Define the type of service interruptions the system is meant to withstand.
- What needs to stay online?
- In the case of an outage, what are the critical functionality which must stay up?
- Are there any functions which can be offline for some time? If so, how long?
- Which parts of the application have regional data?
- Ultimately, what makes DR difficult are the pieces which are tied to a region. This is typically data stores, i.e., filesystems, objects, databases, etc.
- The less dependent you are on regional data, the easier it is to implement DR.
Looking at the questions above, they seem very straightforward and obvious.
The requirements for DR will have a huge effect on the cost and complexity of the DR implementation.
I can tell you from experience that these questions have generated countless hours of debate!
Choose a DR Strategy
There are basically two types of DR Strategy: Active-Passive and Active-Active.
Active-Passive In this strategy, there are basically two copies of the software stack deployed, typically in two different geographic regions. One is “Active” or “hot”, i.e., accepting traffic and servicing the customer. The other copy is “Passive” or “cold” and is essentially a minimal configuration on standby.
In the event of a disaster, the customer traffic is diverted from the Active to the Passive site. At this point, the Passive sight scales up to full capacity and this site is now “Active”. The other site is now “Passive”.
Active-Active In this case, there are two copies of the stack running simultaneously and both are accepting customer traffic. In the case of an outage of a site, all the traffic is directed to the surviving site. At this point, the system is running in a degraded mode.
Once the Cloud Provider recovers, the other site comes back online and the system is fully recovered.
Generally speaking, most technologists would prefer an Active-Active configuration, however its generally more complex.
In practice, most DR is Active-Passive.
Practical Challenges of Disaster Recover
Design
A system which can survive a Cloud Provider needs to be purposefully built. Retrofitting DR into an existing product can be done, but it is very complicated and time consuming.
If you are building a new software stack, be clear on the DR requirements and build them in from Day 1.
Fortunately, today Cloud Providers and third parties are providing data services specifically designed for DR. If possible, use them!
Testing
Just like any other emergency utility, it needs to be tested and make sure it works as expected. This is one of the main reasons defining a disaster is important: it helps drive the testing of the DR Plan.
In one of my prior companies, they did a DR drill on a Saturday once a year. During the drill, there was a simulated outage and they tested all critical applications to ensure they stayed online.
Cost
Ultimately, designing and testing for DR is very expensive in terms of time and resources.
That’s why a clear definition which services are critical and require this additional expense.
Finally, if you are running a DR environment, there are additional Cloud costs for the secondary environment.
Final Thoughts
The reality of today’s connected world is that customers are very demanding and expect 24/7 service. For the most part, Cloud Providers are very reliable, but they do have outages, which ultimately affect your customers.
Retrofitting DR is very complex and there is typically some degraded service until the Cloud Provider recovers.
For new applications, there is an opportunity to take advantage of technologies which were specifically designed for DR.
Utilizing these technologies will make DR much easier and provide resilient services to your customers

