I would like to keep this as an open forum for discussion over various DR scenarios so that everyone can pitch in and share their experiences while setting up a DR site. This would help in sharing and learning from our experiences.
Let me describe the bookish definition of what an actual DR occurrence is:
“An occurrence that disrupts the functioning of an Organization resulting in loss of data, loss of personnel, loss of business or time”
The actual DR planning and design involves lots of consideration about the nature of the business and what is the impact on the Business if Primary site is down.
Several factors need to be considered when establishing a DR. It depends on the business type of the organization and its dependent items, eg, vendor services, telecom links, material availabilities, etc. Choice of DR site should also consider political, geographical, natural, human and other risks associated with the DR site location. For example, a software development company that is heavily dependent on international telecom links cannot have its DR site located in a rural area where the telecom vendors cannot provide data and voice links. Whereas another organization, eg, a manufacturing company could probably have its DR site with some essential equipment located anywhere where there is an electrical supply and transport facilities.
It makes business sense to have the DR site located at an acceptable distance from the main site from a logistics perspective. If essential services have to start rapidly within hours or a business day from an alternative location, the DR site should be located reasonably near your main site to avoid long travel and associated logistics problems. The time to travel to a DR location is a key factor in deciding where it can be located.
There are many potential disruptive threats which can occur at any time and affect the normal business process. We should consider a wide range of potential threats and the results of our deliberations should be included. Each potential environmental disaster or emergency situation should be examined. The focus should be on the level of business disruption which could arise from each type of disaster.
The call of hosting the DR site at a particular location or when DR site should be made active with all the services available to the business users is not a game of an individual. A DR plan must be created by involving various departments within the organization as the DR activity itself is dependent on various kind of users. Before creating a plan, every organization must classify its functions in terms of priorities and impacts.
It’s not just the DR site planning but a proper process document should be in place which would define how and what users will be making the connections to the DR site and how will the end users will be notified about the availability of the services at the DR site.
The game is just not to host a DR site but proper monitoring should be there to track the data and configuration changes compared to the primary site. The possibility could be there that at the time of actual disaster when the DR site is made active, the users may not find the last data entered in to the system or the users are not able to login to their portal due to configuration mismatch. Once the DR site is implemented, IT does not have visibility and confidence whether the deployed solution is meeting the business RPO metrics and whether the services will be made available during actual disaster within stipulated RTO.
Inorder to avoid such situations, regular drills should be planned (may be once in a quarter) and proper monitoring tools should be installed at the DR site.
Some Orgs have their DR setup with manual switch over of the Infrastructure at the DR site and for DB data they use native DB replication technologies. For Application switch over and monitoring we have solutions like Sanovi etc which may cost a fortune to the company but are very affective.
We also have host based solutions for the wintel and Unix environment which provides complete automated replication and switchover/failover of the services from Infrastructure to Application at the DR site.
For the Organization which can afford an RPO/RTO of few days, they prefer low cost replication solution like taking Backup at the Primary site and sending the tapes for restore at the DR. The failover of the Infrastructure and Application is a manual process.
Such solutions should be included and implemented very carefully and is completely dependent on the nature of the Organizations business.
DR situations: that affects Physical facilities or environment, health, welfare or safety of Personnel or public & affects Business Operations due to:
- Earthquakes or Nature Catastrophe
- Terrorist Attacks
- Riots
- Strikes etc
DR can be broken into:
- Business Planning & preparation
- Business Systems & Technology preparation
- Incident Response Planning
What to Protect in a DR? A disruption in IT infrastructure can put customers business in offline mode for several days, when even a few hours of system downtime can critically harm your organization. There are 100’s of data protection and service availability solutions present in the market. But the challenge for the Orgs is to decide what needs to be protected and which order the services should be made available.
I would just list the dependencies of the DR site on various components and how BCP planning should be done to protect the critical data and business affecting services to be made available.
- Business Functions - Functions which provide products or services
- Critical Support functions - Functions without which the Business functions cannot work(e.g Facilities, IT)
- Corporate level support functions – Functions required for effective operations of Business Functions (e.g H.R, Finance)
- Most important Resource: Personnel – Although there are other critical resources, the actual product or service in most Organizations depends on actions performed by, and decisions made by people
HA (High Availability) Solutions – Not Actual DR:
The high availability setup for various IT components at the Primary site cannot be equalized to a real DR site. The customers expectation of the DR site is the availability of the services as it happens in HA. The financial Orgs looks for near line DR and actual DR site for data protection and fast recovery of the services as a minute of data is crucial for them.
But the service availability could be different for different verticals. Its always a challenge for the Service Providers to make customer understand that what kind of DR setup would be best suited for their business.
Not a DR site scenarios:
- IT equipment failure at the DC site - Failed over to the other redundant H/W – which may take less than 30mins and the services will be available to the Business users.
- Data Protection – keeping tape backups and recovering once the original is lost – Not actual DR.
- Near line DR – RPO = ‘ZERO’ (Location <30kms)
– These kinds of setup are mainly implemented for the Organizations which cannot afford to loose the data even for a minute – specially Financial firms or Investment banks etc.
– Typically, this setup will be 1:1 mapping of the Infrastructure and services at the Primary site and NLDR.
– Almost like having a high availability solution for your complete Primary site and not a single equipment or service failure.
The actual DR site falls under the criteria where a Service Provider will take care of the Infrastructure and Business needs and will provide Services/Business continuity from a different site.
- RPO is never equal to ‘0’ and the site to host the protected Infra should be atleast >50Kms (typically Tier3 DC and different Geographic Location)
- Complexities / Challenges:
– Recovery of Business Application – Which department applications to be brought up on priority.
– Prioritize on the time of the Month – Payroll, Taxation, Billing transactions etc.
– Under Provisioned DR – limited load is transferred to the DR site.
– Connectivity
– Human Behavior
The DR Infrastructure should be UP within the minutes but Applications or Services may not be available to the Business users. Hence, RTO cannot be defined as users may take more time to start the business operations.
Every time when Org starts about DR site hosting, the first thing that comes to discussion is RPO/RTO. The service providers agrees and commits on the asked RPO/RTO values but there are always various caveats to it. The solution Architects designs the best solution to achieve the asked values and drives the solution by its uniqueness but almost all the Service Providers have the same concept while designing the DR solution. The only thing that varies is the components included in it and not the concept.
Sample DR process flow:
To conclude, the DR site should be planned by involving different departments in the organization which directly or indirectly affects the business. The DR process flow may vary from one Org to other but the concept remains the same. Most importantly, it’s not just the DR site planning but a proper process document should be in place which would define how and what users will be making the connections to the DR site and how will the end users will be notified about the availability of the services at the DR site.


Mast likha hai bhai!
ReplyDeleteThanks.. I hope we could discuss more about such scenarios....
ReplyDeleteThis is really good article for beginners. Explained in simple language. Keep it up!!
ReplyDeleteVery well written. Seems like a good learning and checklist in layman's. I would want to know more about the DR technologies that can be used for Hybrid environment.
ReplyDeleteRegards,
Francis