Information Technology Disaster Recovery Plan: Case Study
Omar, Adnan, Alijani, David, Mason, Roosevelt, Academy of Strategic Management Journal
Webster's Dictionary defines a disaster as "a sudden calamitous event bringing great damage, loss, or destruction; a sudden or great misfortune or failure." In a contemporary IT context, disaster is an event that shuts down a computing environment for more than a few minutes, often for several hours, days or even years. A disaster can wipe out a company's normal business day or even its entire IT infrastructure. While not different from other kinds of outages, the outage of a company's IT infrastructure spreads over a wider area, and affects more components. It is no longer a question of whether disaster will occur: it will. Thus, establishing reliable disaster recovery (DR) capabilities are critical to ensuring that an organization will be able to survive significant events. Understanding when to initiate DR procedures during an event is critical to achieving expected DR outcomes (BEC Technologies, 2008).
The devastation wrought by Hurricane Katrina in New Orleans forced businesses and universities to seriously reevaluate their DR Plans. Many businesses could not operate because they did not have plans in place. Entire IT infrastructures were crippled by the flooding that resulted from the storm, and many organizations did not have a DR site outside the affected area, which left them without a way to immediately move forward. Disaster recovery is becoming an increasingly important aspect of enterprise computing. As devices, systems, and networks become ever more complex, there are increasingly more things that can go wrong. Consequently, recovery plans have also become more complex (Toigo, 2002). A disaster recovery plan establishes how a company or organization can reinstate its IT systems and services after a significant large-scale interruption.
The principles of Disaster Recovery and Business Continuity Planning are quite straightforward: creating a remote DR center is the first step in developing a well-organized plan, and this will directly affect the recovery capabilities of the organization. The contents of the plan should follow a logical sequence and be written in a standard and understandable format. Effective documentation and procedures are extremely important in a disaster recovery plan. In the wake of Hurricane Katrina, the Houston Community College (HCC) in Houston, Texas, has played a pioneering role in developing a DR plan, and continues developing its systems for the future. The objective of this study is to discuss the Information Technology Disaster Recovery Plan at HCC.
Statement of Problem
The flooding that resulted from Hurricane Katrina in New Orleans, Louisiana, in 2005 compelled most businesses and universities in vulnerable areas to reevaluate their Disaster Recovery plans. Several businesses were crippled because they lost records and information spanning several years. Concerned about their ability to operate if disasters of similar magnitude recurred, managements developed DR plans that describe the IT mechanisms for the purpose of bringing a functioning system back online (Robb, 2005). These plans also help organizations reinstate their IT systems and services after a significant large-scale interruption with a minimal time lag. Without a DR plan in place most businesses run the risk of crippling data loss.
DR solutions are expensive, but with a little planning and foresight DR does not have to be an all-or-nothing proposition (Robb, 2005). Costs can vary tremendously, depending on the needs of each organization, the assessment of the threat, and the level of security one seeks. For most small to midsize institutions, one of the most affordable DR solutions would consist of the remote location for storing tape backups. However, for larger organizations, this method would be unacceptable, although the larger potential for data loss, and steep costs could compel managements to scale down DR plans so that they protects only the most critical applications.
Statement of the Objective
Modern organizations recognize that success increasingly depends on their ability to provide information on demand to customers, suppliers, employees, and other stakeholders. This reality has forced managers to seek better ways to protect their information assets and to prepare for quick recovery in case of a disaster. A business's definition of a disaster varies according to its business model, geographic location, and other factors. System, component, network, and application failures that result in downtime, data loss, and serious financial impact can also be defined as a disaster, whether or not caused by a natural disaster (Greyzdorf, 2007).
However it may define disaster, an organization owes its customers the ability to continue business after a disaster (Lewis, 2006). HCC management realized that the tape backup system in use during Hurricane Katrina would not help in the event of a disaster. The HCC management goal was to create a Disaster Recovery Plan that would have the IT infrastructure operational within 12 hours after a disaster and the business operations up and running within 24 hours at minimum cost.
REVIEW OF LITERATURE
In a moderately short time frame (roughly half a century) disaster response and recovery has been approached from a number of diverse perspectives, including structural functionalism (Bates & Peacock, 1987), and conflict and symbolic interaction (Nigg, 1995). In the process, it has also evolved through at least two distinct paradigms: hazard and vulnerability. The earliest body of research, which began to emerge as recognizable research literature in the 1960s, framed the environment as an agent of disaster or hazard. Accordingly, risk and disaster are embedded within the natural environment, technology, or the built environment. Inherent to this paradigm is the conviction that individuals, businesses and communities are victims of extreme events and dependent on outside or professional assistance for their recovery. Later research acknowledged the role of social vulnerability, manifested through preexisting social structures and conditions, in the explanation of the impacts and responses to disasters. Like the hazard paradigm, this perspective discounts the capacity of local communities to respond appropriately and constructively to disaster. More recently, as disaster research evolved along with other social science research, researchers are acknowledging the importance of both the hazards and vulnerability paradigms (Tootle, 2007).
Most researchers now recognize that both environmental and social processes affect the impacts of disaster and the disaster recovery process. However, Meta Group research (2003) found only 20% of Global 2000 organizations have effective business continuity plans (BCP) to help them in the case of disaster. The study shows this lack of preparation is due to the fact that many of them have failed to distinguish between BCP and DR plans. In fact, an adequate BCP should include human resources, facilities, management as well as the executive board (Susanto, 2003). Meta Group (2003) also said that IT as one of the main business functions should be managed along with other components as an overall BCP. Kenneth Hewitt explains the failure of many organizations to establish a viable BCP by suggesting that the voices of active participants in the recovery process are mostly missing in the long-term recovery process (Tootle, 2007).
Since the terrorist attacks on September 11, 2001, more enterprises have moved BCP from a "complimentary" to a "compulsory" status for the organization. However, many are still unable to correctly identify all the points that should be considered. BCP is often regarded as the IT organization's DR plans. This assumes that business continuity can be simply guaranteed by having a good backup system for the computer network. While this assumption is not unreasonable, since IT systems are gaining ever greater prominence in the overall structure of many corporations, the following issues related to BCP should be considered (Susanto, 2003):
* Premises and geographic issues: When a disaster has destroyed or damaged an organization's premises, the company manager's first task is to find an emergency site to continue operations while the original site is being repaired. This ensures that the operations are not completely shut down. Selecting an emergency site requires a careful plan which stresses that minimal time is spent allocating temporary authority to initiate recovery action (Savage, 2002). Geography can sometimes cause uncertainty about an organization's ability to continue. Compromise between accessibility and safety is often necessary: while a backup site near the original site means high accessibility, it may be less secure in the event of a bigger disaster (Susanto, 2003).
Suggested solution: Analyse business needs and explore the possibility of establishing multiple backup sites. More resources can be dedicated to data communication channels, i.e., developing a backup site at a distance that can communicate as fast as the one next door.
* IT issues: The importance of IT issues has made many people confuse BCP with IT recovery (DRP). Fortunately, many businesses are well prepared for this issue. Some have allocated a certain amount within their budget specifically for IT and the DRP system. The plan should also include details of a communication method, network infrastructure, and third-party vendors; all should be carefully documented within BCP along with external storage/data. An appropriate strategy determined by the significance of IT matter within the business can be chosen from the following: full replication, vendor parallel/semi-parallel or relocation. Many third-party vendors will provide protection ranging from the widely-used tape backup system to state-of-the-art "high availability solutions" which will simultaneously backup all incoming and outgoing data from the system (Susanto, 2003).
Suggested solution: A business should choose the appropriate IT recovery plan; this is not always the expensive "top of the line" option. A well-suited DR plan will maximize all IT resources and capabilities to allow a business to survive a disaster. This plan should be reviewed and adjusted as the business grows (Susanto, 2003).
* Customer service issues: It is important to keep customers informed about effects of the disaster which may cause the disruptions of production or service, and the progress of recovery; customers must also be informed as soon as the business is back online. Keeping customers informed will build trust in the business relationship (Susanto, 2003).
Suggested solution: A customer is the most important entity in any business. Provide them with honest information and immediate solutions, and keep them informed with the progress of business recovery (Susanto, 2003).
* Human resources issues: A major disaster will most immediately affect the employees and possibly their families. It is the organization's responsibility to keep employees informed, and to organize the handling of the process including emergency contact information and keeping in touch with all the employees. Bigger organizations are now also concerned with all their senior management located on the same floor or even in the same building (Susanto, 2003). For them, losing a little convenience will not be as unpleasant as losing the whole business structure. Splitting resources to more than one location will enable the company structure to exist even if one building is destroyed.
Suggested solution: The solution may vary depending on the size of the business. Multiplesite businesses should consider splitting company resources to different locations or different parts of the building/office (Susanto, 2003).
* Documents: BCP should also consider the existence of crucial documents including BCP itself, printed stationary, emergency contact details, location and document accessibility. These are to ensure that the inbound and outbound communication can be initiated soon after the disaster to avoid worse cases (Susanto, 2003).
Suggested solution: Create an offsite office storage area where the company can keep extra stationary, i.e., letterhead, business cards, etc., as well as copies of the BCP document for emergencies (Susanto, 2003).
Disaster Recovery: Business Continuity Plan (BCP) Life Cycle
No organization can have complete control over its business environment. It is therefore essential for companies to have a business continuity management (BCM) capability, in case of crisis or disaster. BCP is a complex plan and sometimes causes confusion about when, where and how to start developing it. The Business Continuity Institute (BCI) has developed a roadmap which is called BCM/BCP life cycle (Susanto, 2003).
Forrester Research and the Disaster Recovery Journal have partnered to field a number of market studies in business continuity and disaster recovery in order to gather data for company comparison and benchmarking and to guide research and publication of best practices and recommendations for the industry. This is the first study, and it is focused on gathering a baseline of company DR preparedness (Balaouras, 2007).
The BCI principles and frequently asked questions have been drawn together to create the BCP life cycle, an interactive process tool to guide the implementation of an effective BCP process. There are six points in BCM life cycle process as shown in Figure 1.
[FIGURE 1 OMITTED]
Each organization needs to assess how to apply the 'good practice' contained within the guidelines to its unique situation. It must ensure that the BCM competence and capability meets the nature, scale and complexity of the business, and reflects the individual culture and operating environment (BCI, 2003).
The primary objective of DR planning is to protect the organization in the event that its operations or computer services are rendered unusable. Preparedness is the key. The planning process should minimize the disruption of operations and ensure some level of organizational stability and an orderly recovery after a disaster. A DR plan is a comprehensive statement of consistent actions to be taken before, during and after a disaster. The plan should be documented and tested to ensure the continuity of operations and the availability of critical resources in the event of a disaster. Most businesses depend heavily on technology and automated systems, and their disruption for even a few days could cause severe financial loss and threaten survival. The continued operations of an organization depend on management's awareness of potential disasters, its ability to develop a plan to minimize disruptions of critical functions and the capability to recover expediently and successfully. HCC chose a solution developed by Oracle's Data Guard to administer the DR plans. With Oracle's Data Guard features, HCC will be able to utilize database logs shipping to maintain a primary and standby database.
The production environment for HCC consists of a 3-teir PeopleSoft System and a Vignette Application all running on Oracle databases sitting on UNIX operating platforms. The systems running are PeopleSoft (Finance, Human Resource and Campus Solution) with each having a web server, application servers and process schedulers to make up the PeopleSoft 3-teir environment. The Operating System platforms are UNIX Solaris and Windows 2000. Figure 2 below is an example of a PeopleSoft 3-teir environment with the process scheduler for reporting, application server and web server:
[FIGURE 2 OMITTED]
As shown in Figure 2, every component connects to the oracle database. At HCC, the databases are Oracle versions 9i and 10g databases. The databases reside on a Solaris operating system running on the UNIX platform. The database is the most important piece for ensuring little to no data loss to the remote database. In the main data center the production database is referred to as the primary production database and the production database in the remote data center is referred to as the secondary production database or standby database. This setup allows for multiple DR centers which can also have multiple remote production database copies.
Remote DR Center: CyrusOne
The DR data center is usually at a location that is considered safe. The HCC Data Center building was created to endure rain and wind. It is also equipped to cope with power outages. The remote DR center that HCC uses is called CyrusOne. CyrusOne is a stand alone, single tenant building, that protects systems from many natural and man-made causes of outages. Each component of the CyrusOne datacenter is designed to ensure maximum availability in all conditions.
CyrusOne gives 100% protection or 100% compliance, specializing in the most cutting edge DR configurations and solutions. The hardware located at this datacenter duplicates the production hardware. To ensure that performance is not compromised, the hardware in both datacenters should be identical.
Replication is the copying of data from one system to another system. The end result is two consistent and equally workable data sets in multiple physical locations. The primary database is the online production database that is used for everyday business. The primary database is located in the main data center. The standby database is the offline production database used to duplicate production data and it is located at CyrusOne, the remote DR center (see Figure 3).
Oracle application Data Guard is used to help manage data replication. HCC has also created a manual management process for data replication. The process was designed using Oracle's Data Guard log shipping. Log shipping allows high availability of the data to the remote site with limited loss of data. This method also allows recoveries to be performed independently of the database location, which means that if the primary database crashes for any reason, the standby database can recover data and transfer it to the primary database when it becomes available. The recovery method is designed to allow for load balancing in all situations of a highly available database, during normal processing, takeover and online self-repair.
Oracle Log shipping occurs at the database level rather than at the server level, which means that if something happens to the server it does not affect the database. One of the disadvantages of Oracle log shipping is that it is network dependent (see Figure 3). As a result, there is some latency involved from the primary to standby database. The network infrastructure plays a major part in the speed and size of the logs being shipped. Because of the uncertainty of the network traffic, Data Guard is set up to ship the log files based on size and time of the last log shipped. This means that the log shipping parameter is set to ship logs when the log size reaches 20 kb or the time when the last switch was greater than 30 minutes. This also helps control data loss. With these parameters set in Data Guard, the management expectation of having only 30 minutes of data loss in the case of a disaster is being achieved. There are no built-in failover capabilities for log shipping, which means that some downtime has to be incurred in the switch to the standby server. This also means that a maximum of 30 minutes of data loss may occur.
[FIGURE 3 OMITTED]
Testing the DRP is the key to ensuring its success. HCC has two system level testing dates per year to ensure that the data replication is accurate. The testing plan includes interrupting the production system and connecting to the remote data center. The network is re-routed to the remote site and all application is activated at the remote data center.
Configuring the Primary Database
To setup the Oracle log shipping the following parameters need to be setup in the primary database:
* Archive log mode--This allows every transaction in your database to be captured in the log files. To setup a remote standby database archive log mode has to be turned on.
* Force Logging--This forces the database to record every transaction that happens in the database. If a structural change happens to the database force logging will make the database write it into the log files. Without force logging if a structure change happens on the primary database the standby database would have to be rebuilt.
* Networking Components--This allows the primary database to communicate to the standby database. In the tnsnames file located on the primary server, create a net service name that the primary database can use to connect to the standby database. Also on the primary host server, create a net service name that the standby database when running on the primary host, can use to connect to the primary, when it is running on the standby host.
* Initialization Parameters--Most of the configuration options for primary and standby servers are implemented with a normal database creation for any Oracle instance. Since our primary and standby host servers are identical (memory, disks, CPU, etc.), the initialization file for both databases is almost identical, with the exception of four key parameters. The parameters that are important for a successful standby database configuration to be set on the primary database are:
These parameters allow the production database to be configured to be replicated. Now using a backup of the primary production database a standby database can be created. Using either a cold or hot (online) backup of the primary database, a standby database is created to resemble the primary database. A key step to all of this is to remember to also copy all of the archived redo logs from the primary database in order to bring the standby database to a consistent state.
Configuring the Standby Database
There are two types of standby database, logical and physical. The main different between the two is that the logical standby database can be opened read/write, even while it is in applied mode. This is mostly used to run reports so that there is not a load on the production server. A physical standby database is an exact copy of the primary database. Oracle uses the primary database's redo log to recover the physical database. A physical standby database can only be opened in read only mode. Once the log files are received, the Manage Recovery process automatically applies the transactions to the database. Oracle log requires the following parameters has to be set in the standby database:
* Archive log mode--This allows every transaction in the database to be captured in the log files. This is not required on the standby database.
* Networking Components--This allows the standby database to communicate to the primary database. On the standby host server, create a net service name that the standby database can use to connect to the primary database. Also on the remote host server create a net service name that the primary database, when running on the remote host, can use to connect to the standby database when it is running on the primary host. This is only needed on the remote if the roles of primary and standby will be switch. Initialization Parameters--This file will not exist on the standby server. It can be created from a copy from the primary server with the following parameter changes: The four parameters that need to be set on the standby database and differ from the primary database are:
* FAL_CLIENT--defines the primary database
* FAL_SERVER--defines the standby database
* LOG_ARCHIVE_DEST_2--defines the location on the remote server to place the log files
* LOG_ARCHIVE_DEST_STATE_2--defines the log file states
[FIGURE 4 OMITTED]
Once the standby database has been created and configured, the standby has to be put in listening mode. This is also referred to as "STANDBY MODE" by Oracle (see figure 4). This can be completed using the following commands from a database prompt:
* SQL> startup nomount ORACLE instance started.
* SQL> alter database mount standby database; Database altered.
At this point the completion of the standby database must occur. The next step is to synchronize the standby database with the primary database by applying the log files as they are being shipped to the remote serve. To do this, execute the following command:
* SQL> alter database recover managed standby database disconnect from session;
Once everything has been setup, verify that database modifications are being successfully shipped from the primary database to the standby database. This can be done by checking the existing archived redo logs on the standby database. Archive a few logs on the primary database, and then check the standby database again to make sure they have been shipped to the standby server.
Implementing Failover Operations
Remember, there is no built in failover capabilities for Oracle failover in a disaster situation, which means that some downtime must be incurred. The failover operation will transition the standby database to the primary role in response to a failure or disaster on the primary database. During a failover operation, the standby databases will become the primary database and the old primary database is considered to be lost. This is usually the case when the primary database is unavailable and there is no possibility of restoring it to service within a reasonable amount of time. A failover can be performed after all or most of the data was last propagated to the standby database after the primary database became unavailable. Once a standby database has been activated, it cannot be put on standby recovery mode; this is because an implicit reset log is performed upon activation. A few steps need to be completed to bring the standby database online. These include:
* Verify that the Primary Database is unavailable
* Cancel managed recovery
* Activate standby database
* Restart database
* Open Database
* Add the Temporary Tablespace
* Modify the TNSNAMES file so that outside connections could be made
Once the database is open, make sure that all transaction logs from the primary database have been restored on the standby server. Once the database has been brought online, all applications will need to change their connections.
A survey conducted by Forrester/Disaster Recovery Journal in October 2007 revealed that 45% of respondents spend less than $500,000 annually on disaster recovery, whereas 20% send between $500,000 and $1.49 million. Exactly how closely these numbers correlate with the annual budgets of the companies is unknown. Spending, however, generally increases with company revenue, which is not surprising; the higher a company's revenue is, the more that company is willing to spend to protect revenue from probably causes of operational downtime (Forrester Research and Disaster Recovery Journal, 2007).
HCC has a student and staff population of approximately 300,000, and a total annual operating budget of $213,132,222 million, of which approximately $ 7,725,700 million are allocated to IT operations. Of the IT allocation, HCC spends $576,000 annually on disaster recovery (see Table 1). Thus, on disaster recovery, HCC spends $1.92 annually per student and staff member, which is approximately 7.46% of its annual IT budget, and 0.27% of its total annual budget. While these figures are consistent with conservative budgetary principles, it must be noted that as a not-for-profit educational institution, the budgeting priorities of HCC may not always translate to a fruitful strategy at a for-profit institution.
Disasters are unavoidable and come with the likelihood that important data maintained by business and universities will be irretrievable. Identification of critical data and a clear plan for recovering and restoring the data is essential. Establishing the necessary steps to take place after a disaster will allow businesses and universities to enter a disaster state with confidence and direction.
On September 13, 2008, Houston was ravaged by Hurricane Ike, the third most destructive hurricane in U.S. history. With winds gusting to 100mph, the 600-mile wide storm caused widespread power outages. The power outage at HCC disrupted the college's primary data centre. However, HCC's system was equipped to cope with such a situation; the college immediately shifted the Internal Protocol (IP) address of the primary data centre to the secondary data centre, and the operations thus continued with minimum disruption.
HCC's implementation of the Disaster Recovery Plan using Oracle DataGuard to the CyrusOne Data Center ensured that the goal to replicate data and IT processes and procedures of critical applications was successful. The DRP implementation helped reach the goal of having the IT infrastructure operating within 12 hours after a disaster, and day-to-day business operating within 24 hours after a disaster. With Oracle DataGuard, the amount of data loss is limited; the data replication capability is more efficient, less expensive, and better optimized for data protection and disaster recovery than traditional tape backup solutions. Due to the successful testing result; management has been forced to look into implementing a Business Continuity Plan for the entire college system.
Balaouras, S. (2007). The State of DR preparedness, Forrester Research and the Disaster Recovery Journal. Retrieved April 5, 2008, from the World Wide Web: http://www.google.com/search?q=cache: GIQCawopD-wJ:https://h30406.www3.hp.com/campaigns/2007/events/dora/images/ webinar-slides-webcast3.pdf+Forrester+Research+and+the+Disaster+Recovery+ Journal.&hl=en&ct=clnk&cd=8&gl=us
Bates, L.F., & Peacock, G. (1987). Disasters and social change. p.227 in R.R. Dynes, B.de Marchi, and C. Pelanda (Eds.) Sociology of Disaster: Contribution of Sociology to Disaster Research. Milano: Franco Angeli.
BEC Technologies. (2008). When should each aspect of disaster recovery be initiated? Retrieved April 17, 2008, from the World Wide Web: http://www.bectechs.com/media/DRdecisionmodel.pdf.
Business Continuity Institute (BCI). (2002). BCM: A strategy for business survival. Retrieved March 13, 2008, from the World Wide Web: http://www.thebci.org/BCI%20GPG%20-%20Introduction.pdf
Business Continuity Institute website. (2003). Good Practice Guide to Business Continuity Management. Retrieved March 14, 2008, from the World Wide Web: www.thebci.org.
Forrester Research and Disaster Recovery Journal. (2007). The State of DR Preparedness. Retrieved February 4, 2008, from the World Wide Web: http://www.drj.com/index.php? Itemid=159&ed=10&id=794&option=com_content&task=view
Greyzdorf, N. (2007). NetApp: Achieving Cost--Effective Disaster Recovery Readiness November 2007.
Lewis, S. (2006). Getting your Disaster Recovery plan going--(Without 'Destroying" Your Budget!). Retrieved February 4, 2008, from the World Wide Web: http://www.edwardsinformation.com/ articles/GETTING%20YOUR%20DISASTER%20RECOVERY%20PLAN%20GOING.asp.
Nigg, M. J., (1995). Disaster Recovery as a Social Process. Retrieved October 8, 2007 from the World Wide Web: http://www.udel.edu/DRC/Preliminary_Papers/.
Robb, D. (2005). Computerworld: Disaster Recovery: Are you ready for trouble? Retrieved November 18, 2007 from the World Wide Web: http://www.computerworld.com/securitytopics/ security/recovery/story/0,10801,101249p3,00.html
Robb, D. (2005). Affording Disaster Recovery. Retrieved March 3, 2008, from the World Wide Web: www.cioupddate.com
Savage, M. (2002). Business continuity planning, Work Study. Vol. 51, No. 5; pp.254-261.
Susanto, L. (2003). Business Continuity / Disaster Recovery Planning. Retrieved June 5, 2008, from the World Wide Web: http://www.susanto.id.au/papers/bcdrp10102003.asp
Toigo, W. J., (2002). Disaster Recovery Planning: Preparing for the Unthinkable, 3rd Edition, Retrieved May 1, 2008, from the World Wide Web: http://www.disastercenter.com/Rothstein/cd651.htm
Tootle, D. (2007). Disaster Recovery in Rural Communities: A Case Study of Southwest Louisiana. Retrieved February 7, 2008, from the World Wide Web: http://www.ag.auburn.edu/auxiliary/srsa/pages/ Articles/SRS%202007%2022%202%206-27.pdf
Adnan Omar, Southern University at New Orleans
David Alijani, Southern University at New Orleans
Roosevelt Mason, Houston Community College
Table 1: Cost estimates for data backup and storage per annum for HCC Parameter HCC Backup Tapes $15,000 Offsite Data Storage $5,000 Hardware maintenance $6,000 Software maintenance $500,000 Software Purchase for safeguarding the data $50,000 Number of employees and students the data storage can serve 300,000 Total $576,000…