Microsoft Multi-Site Failover Cluster for DR & Business Continuity

Not every organisation looses millions of dollar per second but some does. An organisation may not loose millions of dollar per second but consider customer service and reputation are number one priority. These type of business wants their workflow to be seamless and downtime free. This article is for them who consider business continuity equals money well spent. Here is how it is done:

Multi-Site Failover Cluster

Microsoft Multi-Site Failover Cluster is a group of Clustered Nodes distribution through multiple sites in a region or separate region connected with low latency network and storage. As per the diagram illustrated below, Data Center A Cluster Nodes are connected to a local SAN Storage, while replicated to a SAN Storage on the Data Center B. Replication is taken care by a identical software defined storage on each site.  Software defined storage will replicate volumes or Logical Unit Number (LUN) from primary site in this example Data Center A to Disaster Recovery Site B. Microsoft Failover cluster is configured with pass-through storage i.e. volumes and these volumes are replication to DR site. In the Primary and DR sites, physical network is configured using Cisco Nexus 7000. Data network and virtual machine network are logically segregated in Microsoft System Center VMM and physical switch using virtual local area network or VLAN.  A separate Storage Area Network (SAN) is created in each site with low latency storage. Volumes of pass-through storage are replicated to DR site using identical size of volumes.

image

                                     Figure: Highly Available Multi-site Cluster

image

                           Figure: Software Defined Storage in Each Site

 Design Components of Storage:

  • SAN to SAN replication must be configured correctly
  • Initial must be complete before Failover Cluster is configured
  • MPIO software must be installed on the cluster Nodes (N1, N2…N6)
  • Physical and logical multipathing must be configured
  • If Storage is presented directly to virtual machines or cluster nodes then NPIV must configured on the Fabric Zones.
  • All Storage and Fabric Firmware must up to date with manufacturer latest software
  • An identical software defined storage must be used on the both sites 
  • If a third party software is used to replicate storage between sites then storage vendor must be consulted before the replication. 

Further Reading:

Understanding Software Defined Storage (SDS)

How to configure SAN replication between IBM Storwize V3700 systems

Install and Configure IBM V3700, Brocade 300B Fabric and ESXi Host Step by Step

Application Scale-out File Systems

Design Components of Network:

  • Isolate management, virtual and data network using VLAN
  • Use a reliable IPVPN or Fibre optic provider for the replication over the network
  • Eliminate all single point of failure from all network components
  • Consider stretched VLAN for multiple sites 

Further Reading:

Understanding Network Virtualization in SCVMM 2012 R2

Understanding VLAN, Trunk, NIC Teaming, Virtual Switch Configuration in Hyper-v Server 2012 R2

Design failover Cluster Quorum

  • Use Node & File Share Witness (FSW) Quorum for even number of Cluster Nodes
  • Connect File Share Witness on to the third Site
  • Do not host File Share Witness on a virtual machine on same site
  • Alternatively use Dynamic Quorum

Further Reading:

Understanding Dynamic Quorum in a Microsoft Failover Cluster

Design of Compute

  • Use reputed vendor to supply compute hardware compatible with Microsoft Hyper-v
  • Make sure all latest firmware updates are applied to Hyper-v host
  • Make manufacture provide you with latest HBA software to be installed on Hyper-v host

Further Reading:

Windows Server 2012: Failover Clustering Deep Dive Part II

Implementing a Multi-Site Failover Cluster

Step1: Prepare Network, Storage and Compute

Understanding Network Virtualization in SCVMM 2012 R2

Understanding VLAN, Trunk, NIC Teaming, Virtual Switch Configuration in Hyper-v Server 2012 R2

Install and Configure IBM V3700, Brocade 300B Fabric and ESXi Host Step by Step

Step2: Configure Failover Cluster on Each Site

Windows Server 2012: Failover Clustering Deep Dive Part II

Understanding Dynamic Quorum in a Microsoft Failover Cluster

Multi-Site Clustering & Disaster Recovery

Step3: Replicate Volumes

How to configure SAN replication between IBM Storwize V3700 systems

How to create a Microsoft Multi-Site cluster with IBM Storwize replication

Use Cases:

Use case can be determined by current workloads and future workloads plus business continuity. Deploy Veeam One to determine current workloads on your infrastructure and propose a future workload plus business continuity.  Here is a list of use cases of multi-site cluster.

  • Scale-Out File Server for application data-  To store server application data, such as Hyper-V virtual machine files, on file shares, and obtain a similar level of reliability, availability, manageability, and high performance that you would expect from a storage area network. All file shares are simultaneously online on all nodes. File shares associated with this type of clustered file server are called scale-out file shares. This is sometimes referred to as active-active.

  • File Server for general use – This type of clustered file server, and therefore all the shares associated with the clustered file server, is online on one node at a time. This is sometimes referred to as active-passive or dual-active. File shares associated with this type of clustered file server are called clustered file shares.

  • Business Continuity Plan

  • Disaster Recovery Plan

  • DFS Replication Namespace for Unstructured Data i.e. user profile, home drive, Citrix profile

  • Highly Available File Server Replication 

How to configure SAN replication between IBM Storwize V3700 systems

The Metro Mirror and Global Mirror Copy Services features enable you to set up synchronous and asynchronous replication between two volumes between two separate IBM storage, so that updates are made by an application to one volume in one storage systems in prod site are mirrored on the other volume in anther storage systems in DR site.

  • The Metro Mirror feature provides a synchronous-replication. When a host writes to the primary volume, it does not receive confirmation of I/O completion until the write operation has completed for the copy on both the primary volume and the secondary volume. This ensures that the secondary volume is always up-to-date with the primary volume in the event that a failover operation must be performed. However, the host is limited to the latency and bandwidth limitations of the communication link to the secondary volume.
  • The Global Mirror feature provides an asynchronous-replication. When a host writes to the primary volume, confirmation of I/O completion is received before the write operation has completed for the copy on the secondary volume. If a failover operation is performed, the application must recover and apply any updates that were not committed to the secondary volume. If I/O operations on the primary volume are paused for a small length of time, the secondary volume can become an exact match of the primary volume.

Prerequisites:

  1. Both systems are connected via dark fibre, L2 MPLS or IPVPN for replication over IP
  2. Both systems are connected via fibre for replication over FC.
  3. Both systems have up to date and latest firmware.
  4. Easy Tier and SSD installed in both systems.
  5. Remote Copy license activated in both systems.
  6. Volumes are identical in Prod and DR SAN.

Configure Metro Mirror in IBM v3700 Systems

Step1: Activate License

Log on to IBM V3700>Settings>System>Licensing

Activate Remote Copy and Easy Tier License.

    image

Step2: Configure Ethernet Ports & iSCSI in Production SAN and DR SAN

Both systems will communicate via management network but volume will be replicated via Ethernet if remote copy is configured to use replication over Ethernet. This step is necessary for Metro Mirror over Ethernet. Skip this step if you are using FC.

Log on to Production IBM v3700 systems. Settings>Network>Ethernet Ports. Right Click on Node1 Port 2> Configure Copy Group1 and Copy Group2. Assign IP address, Enable iSCSI, Select Copy Group1. Repeat to Create Copy Group2.

image

image

Repeat the step to configure Copy Groups in DR SAN.  
Note: TCP/IP assigned in DR SAN can be from same subnet of production SAN or can be different than production subnet as long as both subnets can communicate with each other.  
Step3: Create Partnership in Prod & DR SAN
Log on to Production V3700>Copy Services>Partnerships>Create Partnership>Add NetBIOS Name and Management IP of DR SAN

clip_image001

Fully Configured Indicates that the partnership is defined on the local and remote systems and is started.

image 

Initial synchronization bandwidth is 2048MBps but once I take the DR storage to DR site. I will modify that to 1024MBps. Initial Synchronization will take place in prod site. You can have your own bandwidth specification.

Log on to DR V3700>Copy Services>Partnerships>Create Partnership>Add NetBIOS Name and Management IP of Prod SAN

clip_image002

Step4: Create Relationships between Volume

log on to Prod SAN>Copy Services>Remote Copy>Add Relationship>Select Metro Mirror

image

image

image

Specify DR SAN as Auxiliary Systems where DR Volume is located. IBM should have used word “Master/Slave” or “Prod/DR systems”.   

image

Add identical volume>Click Next.

image

image

Step5: Monitor Performance and Copy Services

image

image

Consistency group states

Consistent (synchronized) – The primary volumes are accessible for read and write I/O operations. The secondary volumes are accessible for read-only I/O operations.

Inconsistent (copying) – The primary volumes are accessible for read and write I/O operations, but the secondary volumes are not accessible for either operation. This state is entered after the startrcconsistgrp command is issued to a consistency group in the InconsistentStopped state. This state is also entered when the startrcconsistgrp command is issued, with the force option, to a consistency group in the Idling or ConsistentStopped state.

The background copy bandwidth can affect foreground I/O latency in one of three ways:

  • If the background copy bandwidth is set too high for the intersystem link capacity, the following results can occur:
    • The intersystem link is not able to process the background copy I/Os fast enough, and the I/Os can back up (accumulate).
    • For Metro Mirror, there is a delay in the synchronous secondary write operations of foreground I/Os.
    • For Global Mirror, the work is backlogged, which delays the processing of write operations and causes the relationship to stop. For Global Mirror in multiple-cycling mode, a backlog in the intersystem link can congest the local fabric and cause delays to data transfers.
    • The foreground I/O latency increases as detected by applications.
  • If the background copy bandwidth is set too high for the storage at the primary site, background copy read I/Os overload the primary storage and delay foreground I/Os.
  • If the background copy bandwidth is set too high for the storage at the secondary site, background copy write operations at the secondary overload the secondary storage and again delay the synchronous secondary write operations of foreground I/Os.
    • For Global Mirror without cycling mode, the work is backlogged and again the relationship is stopped

image

image

Once volumes are synchronized you are ready to integrate storage with System Center 2012 R2.

Further Readings:

SAN Replication based Enterprise Grade Disaster Recovery with ASR and System Center

What’s New in 2012 R2: Cloud-integrated Disaster Recovery

Understanding IT Disaster Recovery Plan

Install and Configure IBM V3700, Brocade 300B Fabric and ESXi Host Step by Step

Understanding IT Disaster Recovery Plan

Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery is therefore a subset of business continuity. Wiki reference

A disaster recovery plan (DRP) is a documented process or set of procedures to recover and protect a business IT infrastructure in the event of a disaster. Such plan, ordinarily documented in written form, specifies procedures an organization is to follow in the event of a disaster.

Given organizations’ increasing dependency on information technology to run their operations, a disaster recovery plan, sometimes erroneously called a continuity of operations plan (COOP), is increasingly associated with the recovery of information technology data, assets, and facilities.

Disaster Recover VS Business Continuity – Are you mixing up?

Business Continuity is different than a Disaster Recovery but linked together. Business Continuity is a practice enterprise adopt to protect business from a complete failure and wait for recovery. By adopting business continuity, you can be assured that your business will continuity to run in the event of disaster; until all systems are recovered from disaster.

There are many, many businesses that fail after a disaster without a BC plan. DR will get your hardware, software and apps back up and running, but without a business continuity plan to keep your company going during the recovery process, you might not have a reason to recover those items. BC involves your finances, your personnel, your emergency plans and everything else that is a necessity to keep going and serving.

An example of business continuity is that all your corporate inbound and outbound email will come and go via third party cloud based smart host that will store all your email up to 15~30 days but deliver inbound/outbound email straight away to your corporation which means in the event of disaster you will receive and send email from any devices that has internet connectivity. Once systems is restored, cloud based smart host will sync with on premise Exchange Server.

Disaster Recovery Terminology in Alphabetic Order

Alert – Notification that a potential disruption is imminent or has occurred; usually includes a directive to act or standby.

Application Recovery – The component of Disaster Recovery that deals specifically with the restoration of business system software and data after the processing platform has been restored or replaced.

DR Site – A site held in readiness for use during/following an invocation of business or disaster recovery plans to continue urgent and important activities of an organization.

DR Work Area – Recovery environment complete with necessary infrastructure (desk, telephone, workstation, and associated hardware and equipment, communications, etc)

Backlog – The amount of work that accumulates when a system or process is unavailable for a long period of time. This work needs to be processed once the system or process is available and may take a considerable amount of time to process.
A situation whereby a backlog of work requires more time to action than is available through normal working patterns. In extreme circumstances, the backlog may become so marked that the backlog cannot be cleared.

Backup – A process by which data (electronic or paper-based) and programs are copied in some form so as to be available and used if the original data from which it originated is lost, destroyed or corrupted.

Business Continuity – The strategic and tactical capability of the organization to plan for and respond to incidents and business disruptions in order to continue business operations at an acceptable predefined level.

Checklist Tool to remind and /or validate that tasks have been completed and resources are available, to report on the status of recovery. A list of items (names or tasks etc.) to be checked or consulted.

Contingency Plan An event specific preparation that is executed to protect an organization from certain and specific identified risks and/or threats.

Continuous Availability A system or application that supports operations which continue with little to no noticeable impact to the user. For instance, with continuous availability, the user will not have to re-log in, or to re-submit a partial or whole transaction.

Data Backups The copying of production files to media that can be stored both on and/or offsite and can be used to restore corrupted or lost data or to recover entire systems and databases in the event of a disaster.

Data Center Recovery- The component of Disaster Recovery which deals with the restoration of data center services and computer processing capabilities at an alternate location and the migration back to the production site.

Data Recovery- The restoration of computer files from backup media to restore programs and production data to the state that existed at the time of the last safe backup.

Database Replication- The partial or full duplication of data from a source database to one or more destination databases.

Disaster- Situation where widespread human, material, economic or environmental losses have occurred which exceeded the ability of the affected organization (2.2.9), community or society to respond and recover using its own resources. Source: ISO 2.1.11

Disaster Recovery- The process, policies and procedures related to preparing for recovery or continuation of technology infrastructure, systems and applications which are vital to an organization after a disaster or outage.

Hot site- An alternate facility that already has in place the computer, telecommunications, and environmental infrastructure required to recover critical business functions or information systems.

Impact- The effect, acceptable or unacceptable, of an event on an organization. The types of business impact are usually described as financial and non-financial and are further divided into specific types of impact.

Incident- An event which is not part of standard business operations which may impact or interrupt services and, in some cases, may lead to disaster.

Network Outage- An interruption of voice, data, or IP network communications.

Off-Site Storage Any place physically located a significant distance away from the primary site, where duplicated and vital records (hard copy or electronic and/or equipment) may be stored for use during recovery.

Outage- The interruption of automated processing systems, infrastructure, support services, or essential business operations, which may result, in the organizations inability to provide services for some period of time.

Recovery- Implementing the prioritized actions required to return the processes and support functions to operational stability following an interruption or disaster.

Replication– Copying a point of time, structured or unstructured data from between site(s)

Risk- Potential for exposure to loss which can be determined by using either qualitative or quantitative measures.

Recovery Point Objective- A recovery point objective, or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident. The RPO gives systems designers a limit to work to. Wiki Reference

Recovery Time Objective – The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. Wiki Reference

Service Level Agreement (SLA)- A formal agreement between a service provider (whether internal or external) and their client (whether internal or external), which covers the nature, quality, availability, scope and response of the service provider. The SLA should cover day-to-day situations and disaster situations, as the need for the service may vary in a disaster.

System Recovery- The procedures for rebuilding a computer system and network to the condition where it is ready to accept data and applications, and facilitate network communications.

Validation Script- A set of procedures within the Business Continuity Plan to validate the proper function of a system or process before returning it to production operation.

Workaround Procedures- Alternative procedures that may be used by a functional unit(s) to enable it to continue to perform its critical functions during temporary unavailability of specific application systems, electronic or hard copy data, voice or data communication systems, specialized equipment, office facilities, personnel, or external services.

Developing a DR Strategy

Regarding disaster recovery strategies, ISO/IEC 27031, the global standard for IT disaster recovery, states, “Strategies should define the approaches to implement the required resilience so that the principles of incident prevention, detection, response, recovery and restoration are put in place.” Strategies define what you plan to do when responding to an incident, while plans describe how you will do it.

DR Objectives

  1. Reduce Overall Risk
  2. Maintain and Test Your Disaster Recovery Plan
  3. Alleviate Owner/Investor Concerns
  4. Restore Day-To-Day Operations
  5. Comply With Regulations
  6. Rapid Response

Priority Matrix

Priority Severity Impact
Priority 1 Highest High
Priority 2 Medium High Medium
Priority 3 Medium Medium
Priority 4 Medium low Low
Priority 5 Low Low

Who and What are involved in a Disaster Recovery

  1. People
  2. Physical facilities
  3. Technology
  4. Data (Structured & Unstructured)
  5. Third Party Vendor or Suppliers
  6. IT Governance(Policies & Procedures)

Producing a DR Document

A DR document consist of the following sections:

  1. Title
  2. Sub-Title
  3. Corporate Logo
  4. Document history.
  5. Corporate Copyright Info
  6. Table of Content
  7. Executive Summary
  8.  Introduction
  9. Terminology
  10. Roles and responsibilities.
  11. Third Party
  12. Technologies
  13. Site Diagram
  14. Incident response.
  15. Plan activation.
  16. Procedures.
  17. Appendixes.

In conclusion, once your DR plans have been completed, they are ready to be implemented. This process will determine whether business will recover and restore IT assets as planned. Remember, this is not about IT department, this is about business who wants to comply and understand importance of disaster recovery. You will only succeed if your business is willing to participate and invest CAPEX and OPEX on disaster recovery.