Disaster Recovery: Zero to Hero

As per Wikipedia, the book definition of Disaster Recovery, aka DR is…

…involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery can therefore be considered as a subset of business continuity.

Proper Preparation, Planning and Practice!

Many of you may or may not realize but I have spent a large majority of my professional career soley focused around being one of the DR ‘guys.’ Just the thought of Disaster Recovery brings back many, many memories from whenever the team and I used to trudge off to the Sungard MegaCenter in Philadelphia, PA (ohhhhh good ole Broad Street in Philly) every few months to “practice” DR. These tests were very expensive to conduct, however, they were extremely valuable. How many organizations today actually practice failure scenarios? Now, in this case our tests lasted usually in the realm of 24, 36 or even 48 hours in length. Each go around, Senior Leadership of the business would meet with IT Leaders to determine which applications were:

  1. Mission Critical
  2. Have not been tested
  3. Have been tested, but deemed a failure

The initial takeaway: DR isn’t something which just happens. It requires proper preparation, planning and practice!! Within these test exercises, which as I mentioned, in my case, lasted up to 48 hours. Believe me when I tell you, it was a sweat to get things up and running within the given period of time and in some instances we failed. It is best to fail and learn from mistakes during practice exercises and not fail on game day.

Oh the good ole days…

Looking back at how operations were conducted just a few, 7 or 8 years ago, things were so wrong – but were they really wrong? Hindsite is always 20/20…Recovery was not achieved by simply failing over from one location to another. Most operations – yes even Fortune 500 companies – would be at best, recovering from tape. Yup, TAPE! The Recovery Time Objective of Tape is measured in days!!! When that Tornado, Category 5 Hurricane with wind speeds clocking over 100 MPH or a Forest Fire struck, for all intents and purposes most if not all organizations would have been completely crippled. I’ve only mentioned natural disasters…what about the Zombie Apocolypse? Or worse, the time when that on-site support engineer from your favorite 3-Letter vendor accidentally removed several disk shelves from the production array and then proceeded to place the disks back in the tray incorrectly – the story went something like that – late on a Friday afternoon?!? Yeah, you guessed it – Tape to the rescue….or not to the rescue?

This all sounds so wrong – but was it? Mind you this was around the 2010 – 2012 time frame. At this point, just like most organizations, we too were still figuring out if this virtualization thing was right or not. I actually can remember back to my first GSX server experience which ran on a Dell tower box underneath my colleagues desk and no one knew it was actually running VMs…shhhhh, don’t tell.

When recovery is measured in days not hours, is it really truly disaster recovery? In these failure scenarios, does your organization know where to begin to start restoring? In your current company, do you know exactly which applications are most critical to the business and EXACTLY how to recover them? Chances are pretty good that you don’t. Are there documented runbooks of specific services which need to be brought online, in which order and whom to contact to validate an application when recovery is completed?

One fact remains constant…Downtime Costs Money! Many studies have been conducted through the years which focus on measuring the actual costs an organization faces when systems go offline. Its easy to see that some companies would not be able to survive a disaster without a proper Disaster Recovey plan in place.

The Virtualization and Cloud Era

A proper Disaster Recovery plan is crucial to an overall Business Continuity Plan or Strategy which every organization needs to have in place. Much like the way applications and services were completely flipped on their head, Virtualization and now Cloud has completely flipped the way companies look at architecting their DR strategies…or has it?

Its a well known fact of life, Virtualization has made the overall management of application resources (ie: Virtual Machines) significantly better than when those same applications were deployed on bare-metal hardware. One would think, with updated sever methodologies, would come updated Backup and Disaster Recovery methodologies as well and it has in some regards. No longer do we deploy agents within a Guest VM to perform backup. Over the course of the last few years, pretty much ever single backup vendor on the planet has figured out how to conduct host-based, image-level VM backups. Sure there’s nuances in which specific vendors have in this regard but for all intents and purposes backup is easier than it was before. Unfortunately, the concept of Disaster Recovery has not received that same level of attention. DR is still complex, slow, unreliable and in many extreme cases organizations don’t even have a secondary site to perform DR to. What about Cloud-based apps and services, how do you DR those?

If you’re waking up in the morning, looking in the mirror and you don’t have a DR Plan – you best be getting on that project ASAP. Its not a matter of if…it is a matter of when. Circling back to our Wikipedia definition…

…DR involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure.

Enter Current Day Disaster Recovery…

Times have significantly changed. Computer Systems and the data they serve provide the life-blood of virtually every organization on the planet, all the way from Joe’s Bait & Tackle Shop to the more obvious…Netflix, Amazon, Macy’s, Airline Carriers – spanning across every vertical – Hospitals, Police, Fire, Emergency Dispatch. Hey man, when I dial 911 the person on the other end of the phone better be able to get the EMT out STAT!

Senior Executives require the assurance that when Disaster strikes, the business’ computer systems are able to be recovered in a fast, reliable and simple fashion. Proper DR planning, documentation and execution needs to be addressed. Frequent tests must be conducted to ensure that when things do go wrong, systems are able to be recovered within the desired amount of time.

So, how can you get started today? Start by meeting with Business and Application owners regularly. Do not be afraid to ask questions and begin documenting the most critical applications within an Application Runbook. Like most IT concepts, technology is a single piece of the puzzle and as I have shared, the DR puzzle has many pieces. Business stakeholders, applications, frequent DR testing and proper documentation. Performing these steps will lead to flawless execution. In summary, this boils down to a few key points to consider whenever building out a DR Strategy:

  • Prevention
  • Detection
  • Correction

To wrap things up, a successful Disaster Recovery Strategy meets the business’ Recovery Point and Recovery Time requirements, allowing everyone to sleep better at night knowing that when things go awry, it’ll be alright.

Let’s keep the conversation going, share your thoughts in the comments section below.