Federal IT pros must be prepared for every situation. For most, a database disaster is not a matter of “if,” but rather, “when”. A database disaster—a crash, a failure, and the subsequent potential loss of valuable data—is going to happen at some point.
The solution is simple: have backups. The first step to surviving any database disaster is having a good backup available and ready.
Too often, agencies assume having a backup is enough. That is simply not the case. While it’s the first step in successfully recovering from a database disaster, it’s certainly not the only step. Agencies should have in place a robust, comprehensive plan that starts with the assumption that a database disaster is inevitable, and build in layers of contingencies to ensure quick data recovery and continuity of operations.
Let’s look at the building blocks of that plan.
Defining the Plan
There can be a lot of confusion surrounding the terminology used when creating a backup and recovery plan. So much, in fact, that it’s well worth the space to review definitions. The following are some of the terms you should be certain and comfortable with before creating your plan:
High Availability (HA): This essentially means “uptime.” If your servers have a high uptime percentage, then they are highly available. This uptime is usually the result of building out a series of redundancies regarding critical components of the system.
Disaster Recovery (DR): This essentially means “recovery.” If you are able to recover your data, then you have the makings of a DR plan. Recovery is usually the result of having backups that are current and available.
(Warrants mentioning: HA is not the same as DR. They are two very different things and often used to complement one another. More on this later.)
Recovery Point Objective (RPO): This is the point in time in which you can recover data—if there is a disaster and data is lost—as part of an overall continuity of operations plan. This defines an acceptable amount of data loss based on a time period. For example, if you establish an RPO of 15 minutes, then you ensure that a backup takes place every 15 minutes (or less). The key here is to establish a number based on actual data, and potential data loss, rather than simply deciding that 15 minutes sounds like a good number. Do the research and set your RPO accordingly.
Recovery Time Objective (RTO): This is the amount of time allowable for you to recover data. For example, let’s say you perform log backups every five minutes. If it takes hours for you to recover lost data because you have to apply all those log backups, it will defeat the purpose of your RPO definition.
It is important to note that the continuity of operations and recovery plan should include both a recovery point objective (RPO) and a recovery time objective (RTO).
Estimated Time to Restore (ETR): It is not uncommon to see ETR and RTO used interchangeably; however, they are two different things. ETR estimates how long it will take to restore your data. This estimate will change as your data grows in size and complexity. Therefore, ETR is the reality upon which the RTO should be based. Think of comparing RTO and ETR as you might compare projected uptime versus actual uptime. While those numbers should be similar (or the same), they may be quite different, especially as your infrastructure grows in size and complexity.
Remember to check often to verify the ETR for your data is less than the RTO before disaster strikes. Otherwise, you will not be as prepared as you should be.
Knowing Your Plan
Now that we know the lingo, let’s think about the plan itself. Some may consider replication as all that’s necessary for successful recovery. It’s not that simple—as I stated earlier, agencies should have a robust, comprehensive plan that includes layers of contingencies.
Remember that HA is not the same as DR. For an example of why that is, let’s look at a common piece of technology that is often used as both an HA and DR solution.
Take this scenario: You have a corruption at one site. The corruption is immediately replicated to all the other sites. That’s your HA in action. How you recover from this corruption is your DR. The reality is, you are only as good as your last backup.
Another reason for due diligence in creating a layered plan: what if your RPO and RTO agreements are no longer (or never were) compatible?
For example, perhaps your RPO states that your agency database must be put back to a point in time no more than 15 minutes prior to the disaster, and your RTO is also 15 minutes. That means, if it takes 15 minutes in recovery time to be at a point 15 minutes prior to the disaster, you are going to have up to 30 minutes (and maybe more) of total downtime.
The reality of being down for 30 minutes and not the expected 15 minutes can make a dramatic difference in operations. Research, layers, and coordination between those layers are critical for a successful backup and recovery plan.
A final point to consider in your disaster recovery plan is cost. Prices rise considerably as you try to narrow the RPO and RTO gaps. As you approach zero downtime, and no data loss, your costs skyrocket depending upon the volume of data involved. Uptime is expensive.
Cost is the primary reason some agencies settle for a less-than-robust plan, and why others sometimes decide that (some) downtime is acceptable. Sure, downtime may be tolerable to some degree, but not having backups or a DR plan in place for when your database will fail? That is never acceptable.