Disaster Recovery

Why This Matters

Your production web server will not boot. The GRUB bootloader is corrupted. You have customers waiting, your phone is ringing, and your boss is standing behind you. What do you do?

Or consider this: a filesystem has become corrupted after a power outage. The server comes back up, but /var will not mount. Your application logs, your database, your mail spool -- all on that filesystem. You need to get that data back.

Disaster recovery is not about preventing disasters -- that is what backups, RAID, and monitoring are for. DR is about what you do after something has gone catastrophically wrong. It is about having a plan, having the tools, and having the muscle memory to execute under pressure.

The time to practice DR is not during a disaster. The time to practice is now, in a lab, when nothing is on fire. This chapter walks you through the critical DR skills every Linux administrator needs: rescue media, GRUB recovery, filesystem repair, disk cloning, and restore procedures.

Try This Right Now

Prepare yourself for recovery by checking what you have available:

# Do you know your GRUB version?
$ grub-install --version 2>/dev/null || grub2-install --version 2>/dev/null

# Can you access a root shell from GRUB? (we will learn how)
# Check if you have rescue tools available
$ which fsck xfs_repair e2fsck 2>/dev/null
$ which ddrescue 2>/dev/null || echo "ddrescue not installed"
$ which clonezilla 2>/dev/null || echo "clonezilla not installed"

# Check your current boot setup
$ lsblk -o NAME,FSTYPE,MOUNTPOINT,SIZE
$ cat /etc/fstab

# Do you have a live USB ready? If not, make one today.

DR Planning Basics

Before any tools, you need a plan. Every organization should have a written DR plan that answers these questions:

┌──────────────────────────────────────────────────────────┐
│               DISASTER RECOVERY PLAN                      │
│                                                           │
│  1. WHAT could go wrong?                                  │
│     - Disk failure                                        │
│     - Filesystem corruption                               │
│     - Bootloader corruption                               │
│     - Accidental data deletion                            │
│     - Ransomware / security breach                        │
│     - Hardware failure (motherboard, PSU)                  │
│     - Natural disaster (fire, flood, earthquake)          │
│                                                           │
│  2. WHAT is the impact of each?                           │
│     - Which services go down?                             │
│     - How many users are affected?                        │
│     - What is the financial cost per hour of downtime?    │
│                                                           │
│  3. HOW do we recover from each?                          │
│     - Step-by-step procedures                             │
│     - Who is responsible?                                 │
│     - What tools and media are needed?                    │
│     - Where are the backups?                              │
│                                                           │
│  4. HOW LONG can we afford to be down?                    │
│     → RTO (Recovery Time Objective)                       │
│                                                           │
│  5. HOW MUCH DATA can we afford to lose?                  │
│     → RPO (Recovery Point Objective)                      │
└──────────────────────────────────────────────────────────┘

RTO and RPO

These two metrics drive every DR decision:

RTO (Recovery Time Objective): The maximum acceptable time to restore service after a disaster. If your RTO is 4 hours, you need to be back online within 4 hours.

RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time. If your RPO is 1 hour, you cannot lose more than 1 hour of data, which means you need backups at least hourly.

Timeline of a disaster:

Last backup    Disaster occurs    Service restored
    │               │                    │
    ▼               ▼                    ▼
────●───────────────●────────────────────●────────►
    │◄─── RPO ─────►│                    │
    │   (data loss)  │◄───── RTO ───────►│
    │                │   (downtime)       │

Example scenarios:

System	RTO	RPO	Implication
Personal blog	24 hours	1 week	Daily backups, manual restore
E-commerce site	1 hour	15 minutes	Hot standby, frequent backups
Bank transaction system	Near zero	Zero	Active-active replication
Internal wiki	8 hours	24 hours	Daily backups, next-day restore

Think About It: Your company's email server goes down. The boss says "get it back immediately" but has not approved budget for redundant servers. What questions should you ask to establish a realistic RTO and RPO?

Bootable Rescue Media

The first thing you need when a system will not boot is rescue media -- a bootable USB drive with tools to repair the system.

Creating a Bootable USB

# Download a rescue-focused distribution
# SystemRescue (formerly SystemRescueCd) is excellent for this:
# https://www.system-rescue.org/

# Write it to a USB drive
# FIRST: identify the USB device (be VERY careful here)
$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  500G  0 disk
├─sda1   8:1    0  512M  0 part /boot/efi
└─sda2   8:2    0  499G  0 part /
sdb      8:16   1  32G   0 disk              ← USB drive
└─sdb1   8:17   1  32G   0 part

WARNING: The dd command below will DESTROY all data on the target device. Triple-check the device name. Writing to the wrong device can wipe your system disk.

# Write the ISO to the USB drive
$ sudo dd if=systemrescue-11.00-amd64.iso of=/dev/sdb bs=4M status=progress conv=fsync

Alternatively, most Linux distributions' live ISOs work well as rescue media. Ubuntu, Fedora, and Debian live images all include fsck, chroot, and disk utilities.

Booting Into Rescue Mode

When the system will not boot normally:

Insert the rescue USB
Enter BIOS/UEFI firmware settings (usually F2, F12, Del, or Esc at POST)
Set USB as the first boot device
Boot from USB
You now have a working Linux environment with access to the broken system's disks

GRUB Recovery

GRUB (GRand Unified Bootloader) is the most common bootloader on Linux systems. When it breaks, the system will not boot at all.

Symptoms of GRUB Problems

grub rescue> prompt (GRUB cannot find its configuration)
error: unknown filesystem (GRUB cannot read the boot partition)
error: file not found (kernel or initramfs missing)
System boots to a black screen with a blinking cursor

Recovery from the GRUB Rescue Prompt

If you see grub rescue>, GRUB is loaded but cannot find its configuration:

# At the grub rescue prompt, find the boot partition
grub rescue> ls
(hd0) (hd0,msdos1) (hd0,msdos2)

grub rescue> ls (hd0,msdos1)/
./ ../ grub/ vmlinuz initrd.img

# Set the root and prefix
grub rescue> set root=(hd0,msdos1)
grub rescue> set prefix=(hd0,msdos1)/grub

# Load normal mode
grub rescue> insmod normal
grub rescue> normal

This should bring you to the normal GRUB menu. Once booted, fix it permanently.

Reinstalling GRUB from Live Media

Boot from rescue/live media, then:

# Identify your partitions
$ lsblk -f
NAME   FSTYPE LABEL MOUNTPOINT
sda
├─sda1 vfat
├─sda2 ext4
└─sda3 ext4

# Mount the root filesystem
$ sudo mount /dev/sda2 /mnt

# If you have a separate /boot partition, mount it too
$ sudo mount /dev/sda1 /mnt/boot

# For UEFI systems, mount the EFI partition
$ sudo mount /dev/sda1 /mnt/boot/efi

# Mount essential virtual filesystems
$ sudo mount --bind /dev /mnt/dev
$ sudo mount --bind /dev/pts /mnt/dev/pts
$ sudo mount --bind /proc /mnt/proc
$ sudo mount --bind /sys /mnt/sys

# Chroot into the broken system
$ sudo chroot /mnt

# Now you are "inside" the broken system. Reinstall GRUB.

# For BIOS/MBR systems:
$ grub-install /dev/sda
$ update-grub

# For UEFI systems:
$ grub-install --target=x86_64-efi --efi-directory=/boot/efi
$ update-grub

# Exit chroot and reboot
$ exit
$ sudo umount -R /mnt
$ sudo reboot

Distro Note: On Fedora/RHEL, use grub2-install and grub2-mkconfig -o /boot/grub2/grub.cfg instead of grub-install and update-grub.

Filesystem Repair

Filesystem corruption can happen after power outages, kernel panics, or hardware failures. The repair tools depend on the filesystem type.

ext4 Repair with fsck

# IMPORTANT: Never run fsck on a mounted filesystem!
# Unmount first, or boot from rescue media.

# Check and repair an ext4 filesystem
$ sudo fsck.ext4 -f /dev/sda2
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda2: 45231/3276800 files (0.5% non-contiguous), 982145/13107200 blocks

# Automatically fix all problems (use with caution)
$ sudo fsck.ext4 -fy /dev/sda2

# Check without making changes (dry run)
$ sudo fsck.ext4 -n /dev/sda2

WARNING: Running fsck on a mounted filesystem can cause severe data corruption. Always unmount first or run from rescue media. The root filesystem can be checked by booting into single-user mode or from live media.

XFS Repair with xfs_repair

# XFS uses xfs_repair, not fsck
$ sudo xfs_repair /dev/sda3

# If xfs_repair fails, try clearing the log first
$ sudo xfs_repair -L /dev/sda3
# WARNING: -L destroys the log, which may lose recent data

# Check without modifying (dry run)
$ sudo xfs_repair -n /dev/sda3

Checking the Root Filesystem

Since you cannot unmount / while the system is running, force a check at next boot:

# Force fsck on next boot (ext4)
$ sudo touch /forcefsck

# Or set the filesystem to require a check
$ sudo tune2fs -C 100 -c 1 /dev/sda2
# This sets the check count so fsck runs on next mount

Recovering Deleted Files

When a file is deleted, the data is not immediately erased -- only the directory entry and inode references are removed. The data blocks remain on disk until overwritten by new data.

Key Principles

Stop writing to the filesystem immediately. Every write reduces the chance of recovery.
Mount the filesystem read-only if possible.
Work on a copy of the disk, not the original.

Tools for File Recovery

# extundelete - for ext3/ext4 filesystems
$ sudo apt install extundelete    # Debian/Ubuntu

# Recover all recently deleted files
$ sudo extundelete /dev/sda2 --restore-all
# Recovered files appear in RECOVERED_FILES/

# Recover a specific file
$ sudo extundelete /dev/sda2 --restore-file home/user/important.txt

# testdisk - for multiple filesystem types
$ sudo apt install testdisk

# Run testdisk interactively
$ sudo testdisk /dev/sda
# Follow the menu: Analyze → Quick Search → list files

# photorec - recovers files by signature (works even on damaged filesystems)
$ sudo photorec /dev/sda
# Recovers files by type (photos, documents, etc.)
# Does NOT preserve filenames

Think About It: Why does "stop writing to the filesystem" matter for file recovery? What happens to the deleted file's data blocks when new files are written?

Disk Cloning

Disk cloning creates an exact, bit-for-bit copy of a disk. This is essential for DR, migration, and forensic analysis.

Cloning with dd

# Clone entire disk to another disk
$ sudo dd if=/dev/sda of=/dev/sdb bs=64K status=progress conv=noerror,sync
# if = input file (source)
# of = output file (destination)
# bs = block size (64K is a good balance)
# status=progress = show progress
# conv=noerror = continue past read errors
# conv=sync = pad read errors with zeros

# Clone disk to an image file
$ sudo dd if=/dev/sda of=/backup/server-image-$(date +%Y%m%d).img \
    bs=64K status=progress

# Compress the image (disks have lots of empty space)
$ sudo dd if=/dev/sda bs=64K status=progress | gzip -c > /backup/server.img.gz

# Restore from compressed image
$ gunzip -c /backup/server.img.gz | sudo dd of=/dev/sda bs=64K status=progress

WARNING: dd does not ask for confirmation. Swapping if and of will overwrite your source disk with the contents of the destination. Triple-check your command before pressing Enter.

ddrescue: For Failing Disks

When a disk is failing with read errors, standard dd may hang or fail. ddrescue is designed for exactly this situation -- it copies what it can, skips bad sectors, and returns to retry them later.

# Install ddrescue
$ sudo apt install gddrescue        # Debian/Ubuntu (note: gddrescue, not ddrescue)
$ sudo dnf install ddrescue          # Fedora/RHEL

# Clone a failing disk (first pass: quick copy, skip errors)
$ sudo ddrescue -d -r0 /dev/sda /backup/rescue.img /backup/rescue.log
# -d = direct access (bypass kernel cache)
# -r0 = do not retry bad sectors yet
# The log file tracks which sectors were read

# Second pass: retry bad sectors
$ sudo ddrescue -d -r3 /dev/sda /backup/rescue.img /backup/rescue.log
# -r3 = retry bad sectors 3 times
# The log file ensures already-read sectors are skipped

The log file is critical -- it lets you stop and resume the rescue operation, and it ensures ddrescue does not re-read sectors it already copied successfully.

Clonezilla: Disk Imaging Made Easy

Clonezilla is a partition and disk imaging/cloning program. It is like the open source equivalent of Norton Ghost or Acronis True Image.

# Clonezilla is typically booted from a live USB
# Download from: https://clonezilla.org/

# Key Clonezilla features:
# - Clone disk to disk (device-to-device)
# - Clone disk to image (device-to-image file)
# - Multicasting (image one disk to many machines)
# - Supports ext4, XFS, Btrfs, NTFS, FAT32, and more
# - Only copies used blocks (much faster than dd)

Clonezilla works through a text-based menu system. Boot from the Clonezilla USB and follow the prompts. For scripted/automated cloning, Clonezilla provides a command-line interface as well.

Restoring from Backups

Having backups is only half the equation. You need to know how to restore from them under pressure.

Restoring from tar

# Full system restore from tar backup
# Boot from rescue media first, then:

# Mount the target filesystem
$ sudo mount /dev/sda2 /mnt

# Extract the backup
$ sudo tar -xzf /backup/full-system-20250118.tar.gz -C /mnt

# Restore GRUB
$ sudo mount --bind /dev /mnt/dev
$ sudo mount --bind /proc /mnt/proc
$ sudo mount --bind /sys /mnt/sys
$ sudo chroot /mnt
$ grub-install /dev/sda
$ update-grub
$ exit

# Unmount and reboot
$ sudo umount -R /mnt
$ sudo reboot

Restoring from borg

# Boot from rescue media with borg installed, then:

# Mount the target filesystem
$ sudo mount /dev/sda2 /mnt

# Mount the backup storage
$ sudo mount /dev/sdb1 /backup

# List available archives
$ borg list /backup/borg-repo
home-20250118-1430       Sat, 2025-01-18 14:30:00
home-20250119-1430       Sun, 2025-01-19 14:30:00
system-20250118-0200     Sat, 2025-01-18 02:00:00

# Restore the system archive
$ cd /mnt
$ borg extract /backup/borg-repo::system-20250118-0200

# Fix bootloader, fstab, etc.
$ sudo chroot /mnt
$ grub-install /dev/sda
$ update-grub
$ exit

Restoring from rsync

# rsync backups are just files, so restoration is straightforward
$ sudo mount /dev/sda2 /mnt
$ sudo rsync -avh /backup/system/ /mnt/
$ sudo chroot /mnt
$ grub-install /dev/sda
$ update-grub
$ exit

Documenting Recovery Procedures

A DR plan that exists only in someone's head is not a plan. Document everything.

What to Document

┌──────────────────────────────────────────────────────────┐
│          DR DOCUMENTATION CHECKLIST                       │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  For each critical server, document:                      │
│                                                           │
│  □ Server name, IP, role, and owner                       │
│  □ Disk layout (lsblk output, /etc/fstab)                │
│  □ LVM configuration (pvs, vgs, lvs output)              │
│  □ RAID configuration (mdadm --detail output)            │
│  □ Partition table (fdisk -l output)                     │
│  □ Installed packages list                                │
│  □ Custom configuration files locations                   │
│  □ Backup location, schedule, and retention               │
│  □ Step-by-step restore procedure                         │
│  □ Service startup order and dependencies                 │
│  □ Contact information for escalation                     │
│  □ Estimated recovery time                                │
│                                                           │
│  Store documentation:                                     │
│  - In the backup itself                                   │
│  - In a wiki/shared document                              │
│  - Printed copy in a secure location                      │
│                                                           │
└──────────────────────────────────────────────────────────┘

Capturing System State for DR

Create a script that captures all the information you would need to rebuild a system:

#!/bin/bash
# dr-capture.sh - Capture system state for disaster recovery
DIR="/root/dr-docs"
mkdir -p "$DIR"
DATE=$(date +%Y%m%d)

echo "=== DR State Capture: $(hostname) - $(date) ===" > "$DIR/dr-info-${DATE}.txt"

# Disk layout
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT >> "$DIR/dr-info-${DATE}.txt"
echo "---" >> "$DIR/dr-info-${DATE}.txt"

# Partition tables
fdisk -l >> "$DIR/dr-info-${DATE}.txt" 2>/dev/null
echo "---" >> "$DIR/dr-info-${DATE}.txt"

# LVM (if used)
pvs >> "$DIR/dr-info-${DATE}.txt" 2>/dev/null
vgs >> "$DIR/dr-info-${DATE}.txt" 2>/dev/null
lvs >> "$DIR/dr-info-${DATE}.txt" 2>/dev/null
echo "---" >> "$DIR/dr-info-${DATE}.txt"

# RAID (if used)
cat /proc/mdstat >> "$DIR/dr-info-${DATE}.txt" 2>/dev/null
echo "---" >> "$DIR/dr-info-${DATE}.txt"

# Fstab
cp /etc/fstab "$DIR/fstab-${DATE}"

# Network
ip addr > "$DIR/network-${DATE}.txt"
ip route >> "$DIR/network-${DATE}.txt"
cat /etc/resolv.conf >> "$DIR/network-${DATE}.txt"

# Installed packages
dpkg --get-selections > "$DIR/packages-${DATE}.txt" 2>/dev/null  # Debian/Ubuntu
rpm -qa > "$DIR/packages-${DATE}.txt" 2>/dev/null                # RHEL/Fedora

echo "DR state captured to $DIR"

DR Drills

Practice makes recovery possible. Schedule regular DR drills:

Drill 1: Boot Recovery

Boot a VM from rescue media
Deliberately break GRUB (rename /boot/grub/grub.cfg)
Reboot and fix it using the GRUB rescue prompt or live media
Time yourself. Can you do it in under 15 minutes?

Drill 2: Filesystem Repair

Create a VM with test data
Force a power-off (simulate a crash)
Boot from rescue media
Run fsck and repair the filesystem
Verify data integrity

Drill 3: Full System Restore

Back up a VM using borg or tar
Delete the VM (or create a new empty VM)
Restore from backup using rescue media
Boot the restored system
Verify all services are running

Drill 4: Partial Restore

Back up a database directory
Delete the database files
Restore only the database from backup
Start the database and verify data integrity

Think About It: How often should DR drills be performed? What is the cost of a drill versus the cost of discovering your DR plan does not work during an actual disaster?

Debug This

A server will not boot after a power outage. It drops to an emergency shell with:

[FAILED] Failed to mount /var.
[DEPEND] Dependency failed for Local File Systems.
You are in emergency mode. ...
Give root password for maintenance:

You enter the root password and get a shell. What do you do?

Step-by-step diagnosis and repair:

# Check what filesystem /var is
$ grep /var /etc/fstab
/dev/mapper/vg_sys-lv_var   /var   ext4   defaults   0 2

# Try to check and repair it
$ fsck.ext4 -f /dev/mapper/vg_sys-lv_var
# If it asks "Fix?" answer yes (or use -y flag)

# If fsck finds and fixes errors:
$ mount /var
$ exit    # or Ctrl+D to continue normal boot

# If the filesystem is severely damaged:
# Mount read-only first, then copy critical data
$ mount -o ro /dev/mapper/vg_sys-lv_var /var

# If the LV itself is damaged, check LVM
$ vgscan
$ vgchange -ay
$ lvscan

┌──────────────────────────────────────────────────────────┐
│                  What Just Happened?                      │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  DR is about recovery AFTER something breaks:             │
│                                                           │
│  Planning:                                                │
│  - Define RTO (max downtime) and RPO (max data loss)     │
│  - Document everything: disk layout, configs, procedures │
│  - Practice with DR drills                                │
│                                                           │
│  Recovery tools:                                          │
│  - Rescue USB: SystemRescue or any Linux live image       │
│  - GRUB repair: chroot + grub-install + update-grub      │
│  - Filesystem repair: fsck (ext4) / xfs_repair (XFS)    │
│  - File recovery: extundelete, testdisk, photorec        │
│  - Disk cloning: dd, ddrescue, Clonezilla                │
│                                                           │
│  Critical rules:                                          │
│  - Never fsck a mounted filesystem                        │
│  - Stop writing when recovering deleted files             │
│  - Triple-check dd device names                           │
│  - Test restores regularly                                │
│  - Document procedures BEFORE you need them               │
│                                                           │
└──────────────────────────────────────────────────────────┘

Try This

Rescue media: Create a bootable USB with SystemRescue or your distribution's live ISO. Boot from it and explore the tools available.
GRUB recovery: In a VM, rename /boot/grub/grub.cfg to simulate a GRUB failure. Reboot and recover from the GRUB rescue prompt. Then boot from live media and reinstall GRUB properly.
Filesystem repair: In a VM, create an ext4 filesystem on a loop device, write some files, then corrupt it slightly with dd if=/dev/urandom of=/dev/loopX bs=1 count=100 seek=1024. Run fsck and observe the repair process.
Disk cloning: Clone a small partition with dd to an image file. Mount the image file using a loop device and verify the contents are identical to the original.
Bonus challenge: Perform a complete DR drill: back up a VM with borg, destroy the VM, create a new VM from scratch, restore from the borg backup, fix the bootloader, and boot the restored system. Document every step you take and time the entire process.

Linux Book: From First Boot to Production