Skip to main content

Infrastructure Recovery

Infrastructure recovery addresses failures in physical computing resources: hypervisors that host virtual machines, storage arrays that hold data, and utilities (power, cooling, network connectivity) that sustain operations. These failures differ from cloud service outages in that they involve hardware you control, require physical access for diagnosis, and demand coordination with facilities management and hardware vendors.

The procedures in this playbook apply to on-premises data centres, server rooms, and co-location facilities. Field offices with minimal infrastructure follow abbreviated procedures in the final section. Cloud infrastructure failures are addressed in the Cloud Failover playbook.

Activation criteria

Invoke this playbook when any of the following conditions exist:

ConditionSpecific indicatorsThreshold
Hypervisor failureHost unresponsive, VMs inaccessible, management console shows host offlineSingle host affecting 5+ VMs or any host running critical workloads
Storage array failureArray offline, degraded RAID with multiple disk failures, controller failureAny condition preventing data access or risking data loss
Power failureUPS on battery, generator failed to start, utility power loss exceeding UPS capacityBattery runtime below 30 minutes with no restoration in sight
Cooling failureData centre temperature exceeding 27°C, CRAC unit failureTemperature rising and projected to exceed 35°C within 2 hours
Network connectivity lossUpstream link failure, core switch failure, ISP outageComplete loss of external connectivity or internal network segmentation

Do not invoke this playbook for individual VM failures, single disk replacements in healthy RAID arrays, or brief power fluctuations handled automatically by UPS systems. These are standard operational incidents handled through Incident Management.

Roles

RoleResponsibilityTypical assigneeBackup
Recovery commanderOverall coordination, escalation decisions, resource allocationIT Manager or Infrastructure LeadSenior Systems Administrator
Technical leadDirect hands-on recovery, vendor coordination, technical decisionsSystems AdministratorNetwork Administrator
Communications leadStakeholder updates, user notification, leadership briefingService Desk ManagerIT Manager
Facilities coordinatorPhysical access, power systems, cooling, building management liaisonFacilities ManagerOffice Manager

For organisations without dedicated facilities staff, the recovery commander assumes facilities coordination responsibilities. In single-person IT departments, the IT person serves as both recovery commander and technical lead, with a designated leadership contact (COO, Operations Director) handling communications.

Decision framework

Before beginning recovery procedures, determine which failure type you are addressing. Multiple failures occurring simultaneously indicate a common cause, typically utility loss.

+------------------+
| Infrastructure |
| Failure Detected |
+--------+---------+
|
+-------------+-------------+
| | |
v v v
+-----------+ +-----------+ +-----------+
| Hypervisor| | Storage | | Utility |
| Failure | | Failure | | Failure |
+-----+-----+ +-----+-----+ +-----+-----+
| | |
| | +-----+-----+
| | | | |
| | v v v
| | +---+ +---+ +---+
| | |Pwr| |Cool |Net|
| | +---+ +---+ +---+
v v |
+-----------+ +-----------+ |
| Phase 1 | | Phase 1 |<------+
| Hypervisor| | Storage | (after utility
| Recovery | | Recovery | stabilised)
+-----------+ +-----------+

Utility failures take precedence. Restore power, cooling, and network connectivity before attempting hypervisor or storage recovery. Attempting to recover compute or storage systems during ongoing utility issues risks additional damage and wastes effort on systems that will fail again.

Phase 1: Immediate assessment

Objective: Determine failure scope, activate appropriate resources, prevent secondary damage.

Timeframe: 0-30 minutes from detection.

  1. Confirm the failure through multiple sources. Check monitoring dashboards, attempt direct console access, and verify physical indicators if safe to access the facility. A single monitoring alert without corroboration warrants investigation, not full playbook activation.

  2. Classify the failure type using the decision framework above. If multiple failure types are present, identify the root cause. Power failure causes both hypervisor and storage unavailability; address power first.

  3. Notify the recovery team using the emergency contact list. For after-hours incidents, use the on-call escalation path. State the failure type, affected systems, and current impact in the initial notification.

  4. Assess physical safety before entering server rooms or data centres. If fire suppression has activated, cooling has failed with temperatures exceeding 40°C, or electrical hazards exist, do not enter. Contact facilities management or emergency services as appropriate.

  5. Document the start time, initial symptoms, and any error messages visible in monitoring systems or console output. This information is critical for vendor support cases and post-incident review.

Checkpoint: You have confirmed the failure type, assembled the recovery team, and determined it is safe to proceed. If physical safety concerns exist, wait for facilities clearance before continuing.

Electrical safety

Never touch electrical equipment during active power events. If UPS units are beeping, generators are cycling, or you observe sparking or burning smells, evacuate the area and contact facilities or emergency services. Infrastructure can be rebuilt; personnel cannot.

Phase 2: Utility failure response

Skip this phase if the failure is limited to hypervisor or storage with utilities functioning normally.

Power failure response

Power failures cascade rapidly. UPS batteries provide 15-30 minutes of runtime under typical loads. This window exists to enable graceful shutdown, not extended operation.

  1. Check UPS status panels or management interfaces to determine remaining battery runtime. Record the current load percentage and estimated time remaining.

  2. Verify generator status if your facility has backup generation. Generators should auto-start within 30 seconds of power loss. If the generator has not started after 60 seconds, check the generator control panel for fault indicators.

  3. If generator start has failed and battery runtime is below 20 minutes, initiate graceful shutdown of non-critical systems. Prioritise keeping domain controllers, file servers, and database servers running longest.

  4. Contact your utility provider to report the outage and obtain an estimated restoration time. This information determines whether to attempt extended operation or proceed to full shutdown.

  5. If utility restoration will exceed UPS capacity and generator is unavailable, execute orderly shutdown:

    • Shut down application servers first
    • Shut down database servers (allowing transaction completion)
    • Shut down file servers
    • Shut down domain controllers last
    • Allow storage arrays to complete cache flush before power loss
  6. Once all systems are safely shut down, configure UPS units to not auto-restart servers when power returns. You want controlled startup, not uncoordinated boot storms.

Generator troubleshooting:

SymptomLikely causeImmediate action
No start attemptTransfer switch failure, control faultManual transfer if trained; otherwise wait for technician
Cranks but won’t startFuel issue, battery weakCheck fuel level, attempt manual start with boost if available
Starts then stopsLoad transfer failure, overloadReduce load by shutting down non-critical systems
Running but no powerTransfer switch stuckManual transfer switch operation (requires training)

Cooling failure response

Data centre equipment generates substantial heat. Without cooling, temperatures rise approximately 1-2°C per minute in a dense server environment. You have 15-30 minutes before equipment begins thermal shutdown.

  1. Check current temperature readings from environmental sensors. Note the rate of temperature increase to estimate time before critical threshold (35°C for most equipment).

  2. Identify which cooling units have failed. Modern data centres have redundant CRAC or CRAH units; partial failure reduces capacity but allows continued operation at reduced load.

  3. If cooling is partially available, reduce heat load by shutting down non-critical systems. Each powered-off server reduces heat generation by 200-500W.

  4. Open data centre doors to adjacent spaces if those spaces have functioning HVAC. This is a temporary measure that violates physical security controls; document the decision and restore security when cooling is restored.

  5. For total cooling loss, initiate orderly shutdown following the same sequence as power failure. Equipment thermal protection will force shutdowns anyway; controlled shutdown prevents data corruption.

  6. Do not restart equipment until temperatures return below 24°C. Thermal stress from hot-starting equipment causes additional failures.

Network connectivity loss

Network failures prevent remote management but do not directly threaten data or equipment. The urgency depends on business impact rather than infrastructure damage risk.

  1. Determine failure scope by testing connectivity at different network layers:

    • Can you ping the default gateway from a server? (LAN functional)
    • Can you ping external addresses by IP? (WAN functional, DNS may have failed)
    • Can you resolve external DNS names? (Full connectivity, application issue)
  2. Check physical layer indicators. Are switch port lights active? Are fibre transceivers showing link? Physical layer failures require hardware replacement or cable repair.

  3. For ISP failures affecting external connectivity, contact your ISP support line. Obtain a trouble ticket number and estimated restoration time.

  4. If you have redundant ISP connections, verify failover has occurred. Static routes or BGP peering issues can prevent automatic failover.

  5. For core switch failures, implement the switch replacement procedure from your hardware vendor. If a spare switch is available, restore configuration from backup or rebuild manually.

  6. Document the outage start time and symptoms for SLA claims with your ISP.

Checkpoint: Utilities are stable or you have determined that recovery cannot proceed until utilities are restored. If waiting for utility restoration, proceed to Phase 5 (Communications) to keep stakeholders informed.

Phase 3: Hypervisor recovery

This phase addresses recovery when virtualisation hosts fail while storage remains accessible. If storage has also failed, complete Phase 4 first.

Single host failure

A single hypervisor host failure affects only VMs running on that host. High availability clusters automatically restart VMs on surviving hosts if sufficient capacity exists.

  1. Check your hypervisor management console (vCenter, Proxmox, Hyper-V Manager) to confirm which host has failed and which VMs were running on it.

  2. Verify HA has attempted VM restart on other hosts. If VMs are shown as restarting or running on alternate hosts, HA is functioning. Allow 5-10 minutes for VMs to complete startup.

  3. If HA has not activated, check cluster configuration. Common causes of HA failure:

    • Insufficient resources on remaining hosts
    • HA disabled or misconfigured
    • Shared storage inaccessible (address storage first)
  4. For manual VM restart when HA is unavailable, use the management console to register and power on VMs on surviving hosts:

    • Identify VM configuration files on shared storage
    • Add VMs to inventory on a surviving host
    • Power on VMs in dependency order (infrastructure first, then applications)
  5. Investigate the failed host once VMs are recovered. Check management interface for hardware errors, review system event logs, and inspect physical indicators (amber lights, error LEDs).

  6. Open a support case with your hardware vendor if hardware failure is suspected. Provide system serial number, error codes, and diagnostic logs.

Host rebuild

When a hypervisor host requires complete rebuild due to boot drive failure or operating system corruption:

  1. Replace failed hardware components if the failure was hardware-related. Boot drive replacements typically require only the failed drive; do not replace functioning components speculatively.

  2. Install the hypervisor operating system using your standard build process. Match the version to other hosts in the cluster to ensure compatibility.

  3. Configure networking to match cluster requirements:

    • Management network connectivity
    • VM network connectivity (VLANs, virtual switches)
    • Storage network connectivity (if using iSCSI or NFS)
  4. Connect the host to shared storage. For fibre channel, zone the host to storage arrays. For iSCSI, configure initiators and connect to targets. For NFS, mount datastores.

  5. Add the host to the cluster through the management console. The host should automatically recognise shared storage containing VM files.

  6. Allow the cluster to rebalance VMs across hosts. If manual intervention is needed, migrate VMs from overloaded hosts to the rebuilt host.

Cluster recovery

Complete cluster failure (all hosts offline) requires methodical recovery in correct sequence.

+-------------------------------------------------------------------------+
| CLUSTER RECOVERY SEQUENCE |
+-------------------------------------------------------------------------+
| |
| 1. Storage 2. Management 3. First 4. VM |
| Verification Infrastructure Host Startup |
| |
| +------------+ +------------+ +------------+ +----------+ |
| | Verify | | DNS | | Boot first | | Start | |
| | storage |---->| Domain |---->| hypervisor |---->| VMs in | |
| | accessible | | controller | | host | | order | |
| +------------+ +------------+ +------------+ +----------+ |
| |
| Shared storage These may run First host Priority: |
| must be online on recovered provides 1. Infra |
| before hosts cluster; boot management 2. Data |
| can access VMs from backup if interface 3. Apps |
| needed |
+-------------------------------------------------------------------------+
  1. Verify shared storage is accessible before attempting host recovery. Boot a host and confirm storage connectivity without starting VMs. If storage is inaccessible, address storage issues first (Phase 4).

  2. Boot the host containing your virtualisation management server (vCenter, Proxmox cluster manager) first. If the management server runs as a VM, boot its host and start only that VM initially.

  3. Once management is available, assess host status and storage connectivity across the cluster. Identify any hosts that cannot reconnect to storage.

  4. Start VMs in dependency order:

    • First: Domain controllers, DNS servers, DHCP servers
    • Second: Database servers, file servers
    • Third: Application servers
    • Fourth: Non-critical workloads
  5. Allow 2-3 minutes between starting each tier for services to initialise fully. Starting dependent services before their dependencies causes cascading failures.

  6. Verify application functionality after all VMs are running. Check authentication, database connectivity, and end-user access to applications.

Checkpoint: VMs are running on available hosts. Proceed to Phase 5 if recovery is complete, or continue to Phase 4 if storage issues remain.

Phase 4: Storage recovery

Storage failures directly threaten data integrity. Every decision in this phase balances recovery speed against data preservation.

Array failover

Multi-controller arrays and replicated storage provide automatic failover. Your role is confirming failover succeeded and understanding the degraded state.

  1. Access the storage management interface. Identify which controller or array has failed and confirm the surviving controller has assumed ownership of volumes.

  2. Check volume status. Volumes should show as online with degraded redundancy rather than offline. Offline volumes indicate failover did not succeed.

  3. Verify host connectivity to storage. Multipath configurations should show reduced paths but maintained connectivity. Single-path configurations may require host-side reconfiguration.

  4. Assess performance impact. Running on single controller reduces performance by 30-50%. Determine if this is acceptable for current workloads or if load reduction is needed.

  5. Contact your storage vendor support with controller serial numbers and error codes. Arrange replacement hardware shipment and schedule installation.

  6. Do not attempt controller replacement while I/O is active. Schedule maintenance window for hardware swap.

Degraded RAID recovery

RAID arrays tolerate disk failures up to their redundancy limit. A RAID 6 array survives two disk failures; a third failure causes data loss.

RAID levelFailure toleranceUrgency at n failures
RAID 11 diskHigh: no redundancy remaining
RAID 51 diskCritical: next failure causes data loss
RAID 62 disksHigh at 1 failure, Critical at 2
RAID 101 per mirror pairDepends on which disks failed
  1. Identify which disk or disks have failed using storage management interface. Note slot positions for physical replacement.

  2. Check if hot spare has automatically begun rebuilding. Modern arrays with hot spares initiate rebuild within minutes of failure detection.

  3. If no hot spare is available, obtain replacement disk with matching specifications. Using disks with different specifications risks compatibility issues.

  4. For physical disk replacement:

    • Identify the failed disk slot (use LED indicators)
    • Remove the failed disk (hot-swappable in most enterprise arrays)
    • Insert replacement disk in the same slot
    • Monitor rebuild initiation through management interface
  5. Monitor rebuild progress. Rebuild time depends on disk size and array load:

    • 1TB disk: 2-4 hours
    • 4TB disk: 8-16 hours
    • 10TB+ disk: 24-48 hours
  6. Reduce I/O load during rebuild if possible. Rebuild competes with production I/O, slowing both operations.

Rebuild failure risk

Rebuild operations stress all remaining disks in the array. A second disk failing during rebuild causes data loss in RAID 5. For critical data on RAID 5 arrays, consider restoring from backup rather than trusting rebuild to complete successfully.

Data recovery from replicas

If primary storage is unrecoverable, restore from replicated storage or backup.

  1. Identify your most recent consistent replica. Synchronous replicas provide zero data loss; asynchronous replicas lag by their replication interval (typically 15-60 minutes).

  2. Promote the replica to primary if your storage system supports this operation. This makes the replica writable and typically breaks the replication relationship.

  3. Reconfigure hosts to access the promoted replica. Update iSCSI targets, FC zoning, or NFS mount points as appropriate for your storage protocol.

  4. Verify data consistency before resuming production. Check database integrity, file system health, and application data validation.

  5. Document data loss window if using asynchronous replication. Transactions committed after the last successful replication are lost. Notify affected application owners.

  6. Plan storage infrastructure rebuild. The replica is now your only copy; restore redundancy as soon as replacement hardware is available.

Checkpoint: Storage is accessible and data integrity is confirmed. If data loss occurred, document the extent and notify stakeholders.

Phase 5: Recovery validation

Before declaring recovery complete, verify all systems function correctly.

  1. Test authentication: Can users log in to domain-joined computers? Do SSO applications authenticate correctly?

  2. Test data access: Can users access file shares? Do database-dependent applications retrieve data?

  3. Test critical applications: Walk through key business processes with application owners. Confirm transaction processing, report generation, and external integrations function.

  4. Verify backup systems: Confirm backup agents are running and next scheduled backup will succeed. A recovery without functioning backup leaves you unprotected.

  5. Check monitoring systems: Confirm all recovered systems report to monitoring. Clear false alerts generated during the outage.

  6. Document recovery completion time and any data loss or residual issues.

Phase 6: Post-incident

  1. Notify stakeholders of recovery completion using the communication templates below.

  2. Complete hardware replacement for any failed components operating in degraded mode. Running indefinitely on degraded hardware invites repeat failures.

  3. Restore redundancy: rebuild RAID arrays, re-establish replication, restore HA cluster to full capacity.

  4. Schedule post-incident review within 5 business days. Include recovery team, affected application owners, and facilities management if utility failure was involved.

  5. Update documentation: Revise this playbook based on lessons learned. Update asset inventories if hardware was replaced.

  6. Submit insurance or warranty claims for replaced hardware.

Communications

Internal escalation

StakeholderTimingChannelOwner
IT LeadershipImmediatePhone/SMSRecovery commander
Executive leadershipWithin 30 minutesPhoneCommunications lead
Affected department headsWithin 1 hourEmail + phoneCommunications lead
All staffWithin 2 hoursEmailCommunications lead

Communication templates

Initial notification (30 minutes):

Subject: [INFRASTRUCTURE INCIDENT] [Type] - Systems unavailable
We are responding to [hypervisor failure / storage failure / power outage /
cooling failure / network outage] affecting [specific systems or all systems].
Current status: Assessment in progress
Estimated impact: [List affected services]
Current user impact: [What users cannot do]
Next update: Within [1 hour / 2 hours]
IT is actively working on recovery. Please do not attempt to restart
equipment or systems independently.
Contact: [Recovery commander name and phone]

Progress update (hourly during extended outages):

Subject: [INFRASTRUCTURE INCIDENT] Update [N] - [Status summary]
Current status: [Phase and progress description]
Services restored: [List any restored services]
Services still affected: [List remaining affected services]
Estimated restoration: [Time estimate if known, or "still assessing"]
Next update: [Time]
Thank you for your patience.

Resolution notification:

Subject: [RESOLVED] Infrastructure incident - All services restored
The infrastructure incident that began at [time] has been resolved.
All services are now operational.
Total duration: [X hours Y minutes]
Root cause: [Brief description]
Data impact: [None / Brief description of any data loss]
If you experience any issues accessing systems, please contact
the service desk at [contact details].
A detailed post-incident review will be conducted, and any process
improvements will be communicated to relevant teams.

Field office considerations

Field offices typically have minimal infrastructure: a router, a small switch, a few workstations, and possibly a local server or NAS device. Full data centre recovery procedures do not apply.

Power failure at field office:

  1. Confirm utility power status with building management or neighbours
  2. If UPS-protected equipment exists, check battery status
  3. Shut down servers and NAS devices gracefully before battery exhaustion
  4. For extended outages, staff can work on laptops with mobile data while awaiting restoration
  5. Verify equipment powers on correctly when utility is restored

Server or NAS failure at field office:

  1. Determine if locally stored data exists only at this location or is replicated/backed up centrally
  2. If data is replicated, provision replacement hardware and restore from central systems
  3. If data exists only locally and hardware has failed, evaluate data recovery options with IT leadership
  4. Consider whether local server is still needed or if cloud services could replace it

Network failure at field office:

  1. Test mobile data connectivity as alternative access path
  2. Contact local ISP for outage information
  3. If extended outage, consider temporary mobile hotspot for critical staff
  4. Document outage for SLA claims

See also