Chapter 35: Incident Response

"Everyone has a plan until they get punched in the mouth." --- Mike Tyson (and every incident response team at 2 AM on a Saturday)

The Call Nobody Wants to Get

Saturday, 2:17 AM. Your phone rings. It is the on-call engineer. "Hey, uh... the monitoring dashboard is showing something weird. All our file servers just started encrypting files. And the domain controller... I think it is down."

Your first instinct might be to pull the network cables and shut everything down. That instinct is exactly what gets organizations in trouble. Panic leads to evidence destruction. Unplugging the wrong system can cause more damage than the attacker. Rebooting a compromised server clears volatile memory that contains the decryption key, the malware process, and the C2 connection details. Incident response is not about reacting. It is about following a plan you built before the crisis.

The NIST Incident Response Lifecycle

The National Institute of Standards and Technology Special Publication 800-61 defines the standard framework for incident response. Every security team should know this lifecycle cold.

stateDiagram-v2
    [*] --> Preparation

    Preparation --> Detection: Security event occurs
    Detection --> Analysis: Alert triaged as potential incident
    Analysis --> Containment: Incident confirmed
    Containment --> Eradication: Threat isolated
    Eradication --> Recovery: Threat removed
    Recovery --> PostIncident: Systems restored
    PostIncident --> Preparation: Lessons learned applied

    state Preparation {
        [*] --> Plans
        Plans: IR plan, playbooks, communication plan
        Plans --> Team
        Team: Roles assigned, training complete
        Team --> Tools
        Tools: Forensic toolkit, jump bag ready
        Tools --> Exercises
        Exercises: Tabletop exercises, simulations
    }

    state "Detection & Analysis" as Detection {
        [*] --> Monitor
        Monitor: SIEM alerts, EDR, user reports
        Monitor --> Triage
        Triage: Is this real? What severity?
        Triage --> Classify
        Classify: Determine incident type and scope
    }

    state Containment {
        [*] --> ShortTerm
        ShortTerm: Isolate affected systems
        ShortTerm --> Evidence
        Evidence: Preserve volatile data
        Evidence --> LongTerm
        LongTerm: Implement temporary mitigations
    }

    state Eradication {
        [*] --> Remove
        Remove: Remove malware, close backdoors
        Remove --> Patch
        Patch: Fix root cause vulnerability
        Patch --> Verify
        Verify: Confirm threat fully removed
    }

    state Recovery {
        [*] --> Restore
        Restore: Rebuild from clean images
        Restore --> Monitor2
        Monitor2: Enhanced monitoring
        Monitor2 --> Validate
        Validate: Verify normal operations
    }

    state "Post-Incident" as PostIncident {
        [*] --> Review
        Review: Blameless retrospective
        Review --> Document
        Document: Update playbooks, detections
        Document --> Improve
        Improve: Apply lessons learned
    }

The lifecycle is deliberately cyclical. Lessons learned from one incident feed directly into improved preparation for the next. The organizations that handle incidents well are the ones that invest heavily in the preparation phase --- before any incident occurs.

Phase 1: Preparation

Preparation is 80% of incident response. The time to build your IR plan, assemble your team, and practice your procedures is when nothing is on fire. Once the incident starts, you execute the plan --- you do not create it.

The Incident Response Plan

Every organization needs a written IR plan that covers:

Scope and definitions: What constitutes an "incident" vs. a "security event"? Not every alert is an incident. Define your classification criteria
Roles and responsibilities: Who does what? Who is authorized to make containment decisions? Who communicates with executive leadership? Who talks to the press?
Communication channels: Out-of-band communication is essential --- if the attacker has compromised your email, you cannot use email to coordinate the response. Pre-establish a backup channel (Signal group, dedicated phone bridge, out-of-band Slack workspace)
Escalation criteria: What triggers escalation from Tier 1 to Tier 2 to management to executive leadership to legal to external counsel?
External contacts: Legal counsel, cyber insurance carrier, forensic firm, law enforcement contacts, regulatory notification contacts --- all pre-identified with current phone numbers
Authority matrix: Who can authorize shutting down a production system? Who can authorize paying a ransom? Who can authorize public disclosure?

IR Team Roles

graph TD
    IC["Incident Commander<br/>──────────────────<br/>Overall coordination<br/>Decision authority<br/>Resource allocation<br/>Executive communication"]

    IC --> TL["Technical Lead<br/>──────────────────<br/>Investigation direction<br/>Technical decisions<br/>Evidence coordination<br/>Attack analysis"]

    IC --> CL["Communications Lead<br/>──────────────────<br/>Internal comms (staff)<br/>External comms (PR)<br/>Customer notification<br/>Regulatory reporting"]

    IC --> LL["Legal/Compliance Lead<br/>──────────────────<br/>Legal obligations<br/>Regulatory notification<br/>Evidence preservation<br/>Law enforcement liaison"]

    TL --> FA["Forensic Analyst(s)<br/>──────────────────<br/>Host forensics<br/>Network forensics<br/>Malware analysis<br/>Timeline construction"]

    TL --> SO["Systems/Network Ops<br/>──────────────────<br/>Containment execution<br/>System recovery<br/>Network changes<br/>Log collection"]

    IC --> DOC["Scribe/Documenter<br/>──────────────────<br/>Timeline of all actions<br/>Decision log<br/>Evidence tracking<br/>Meeting notes"]

    style IC fill:#e74c3c,color:#fff
    style TL fill:#3498db,color:#fff
    style CL fill:#f39c12,color:#fff
    style LL fill:#9b59b6,color:#fff
    style FA fill:#2ecc71,color:#fff
    style SO fill:#2ecc71,color:#fff
    style DOC fill:#95a5a6,color:#fff

The Incident Commander role is critical and often neglected. Without a clear IC, you get "too many cooks" --- multiple people making contradictory containment decisions, no one coordinating communication, and critical tasks falling through the cracks. The IC does not need to be the most technical person in the room. They need to be organized, calm under pressure, and decisive. Technical expertise belongs with the Technical Lead.

Phase 2: Detection and Analysis

Incident Severity Classification

Not every incident deserves the same response. A clear severity matrix ensures proportional resource allocation:

Severity	Criteria	Response	Notification	Example
SEV-1 Critical	Active data breach, ransomware spreading, critical infrastructure down	All hands, 24/7 response, war room	CEO, Board, Legal, Insurance, potentially regulators	Ransomware encrypting production servers
SEV-2 High	Confirmed compromise, data exposure possible, significant business impact	IR team engaged, business hours + extended	CISO, VP Engineering, Legal	Compromised admin account with data access
SEV-3 Medium	Confirmed malware on single system, suspicious activity under investigation	IR team investigates, business hours	Security management, system owner	Malware detected and contained on workstation
SEV-4 Low	Policy violation, vulnerability exploitation attempt blocked	Triage and remediate, standard workflow	System owner, security team	Blocked exploit attempt, phishing email reported

Detection Sources

Incidents are detected through multiple channels, each with different reliability and speed:

flowchart LR
    subgraph "Detection Sources"
        A["SIEM Alerts<br/>Automated detection rules"]
        B["EDR Alerts<br/>Endpoint behavioral detection"]
        C["User Reports<br/>'I clicked something weird'"]
        D["Threat Intel<br/>IOC match from feed"]
        E["External Notification<br/>Law enforcement, researcher,<br/>customer complaint"]
        F["Anomaly Detection<br/>Unusual traffic patterns,<br/>login anomalies"]
    end

    subgraph "Triage"
        T["Is this real?<br/>False positive check<br/>Context enrichment<br/>Scope assessment"]
    end

    subgraph "Classification"
        S1["SEV-1: Active breach"]
        S2["SEV-2: Confirmed compromise"]
        S3["SEV-3: Contained threat"]
        S4["SEV-4: Attempted attack"]
    end

    A --> T
    B --> T
    C --> T
    D --> T
    E --> T
    F --> T
    T --> S1
    T --> S2
    T --> S3
    T --> S4

    style S1 fill:#e74c3c,color:#fff
    style S2 fill:#e67e22,color:#fff
    style S3 fill:#f1c40f,color:#333
    style S4 fill:#27ae60,color:#fff

What is the most common way breaches are actually detected? Historically, the most common detection source was external notification --- someone else tells you that you have been breached. A law enforcement agency finds your data on a dark web marketplace. A security researcher discovers your database exposed on the internet. A customer reports fraudulent charges. The trend is improving with better detection tooling, but external notification still accounts for a significant percentage of initial detections, and those externally-detected breaches tend to have the longest dwell times and highest costs.

Phase 3: Containment

Containment is the most time-sensitive phase. The goal is to stop the bleeding --- prevent the attacker from expanding their access, exfiltrating more data, or causing further damage --- while preserving evidence for investigation.

Containment Strategies

flowchart TD
    subgraph "Short-Term Containment (Minutes to Hours)"
        SC1["Network Isolation<br/>Disconnect compromised systems<br/>from network (disable switch port,<br/>change VLAN, host firewall)"]
        SC2["Credential Reset<br/>Reset passwords for compromised<br/>and potentially compromised accounts<br/>Revoke active sessions/tokens"]
        SC3["DNS Sinkhole<br/>Redirect C2 domains to internal<br/>sinkhole to cut attacker comms<br/>without alerting them"]
        SC4["Block IOCs<br/>Block attacker IPs at firewall<br/>Block malicious domains at DNS<br/>Block file hashes at EDR"]
    end

    subgraph "Long-Term Containment (Hours to Days)"
        LC1["Network Segmentation<br/>Create isolated VLAN for<br/>forensic analysis<br/>Restrict lateral movement paths"]
        LC2["Enhanced Monitoring<br/>Deploy additional capture points<br/>Increase logging verbosity<br/>Add detection rules for this threat"]
        LC3["Temporary Patches<br/>Apply emergency patches<br/>Disable vulnerable services<br/>Implement compensating controls"]
        LC4["Access Review<br/>Audit all privileged accounts<br/>Disable unnecessary access<br/>Enforce MFA on all admin accounts"]
    end

    SC1 --> LC1
    SC2 --> LC4
    SC3 --> LC2
    SC4 --> LC3

Critical Containment Decision: Isolate vs. Monitor

This is one of the hardest decisions in incident response. Do you immediately isolate the compromised system, which stops the attacker but also tips them off and may cause them to destroy evidence? Or do you monitor the system to understand the full scope of the compromise before containment, which gives you better intelligence but allows the attacker to continue operating?

The answer depends on the situation:

Isolate immediately when:

Active data destruction (ransomware encrypting files)
Active exfiltration of highly sensitive data (PII, financial data, classified information)
Attacker has access to critical infrastructure (domain controllers, backup systems)
Risk of spread is high and imminent

Monitor first when:

You are unsure of the full scope (how many systems are compromised?)
The attacker appears dormant or slow-moving
You need to identify C2 infrastructure to block comprehensively
Legal or law enforcement requests continued monitoring for attribution
Early isolation would tip off the attacker, causing them to activate dormant implants on other systems

During a major incident, a team discovered a compromised server that was beaconing to a C2 server every 30 minutes. The initial impulse was to isolate it immediately, but the lead responder pushed back: "We know about this one system. How do we know there are not others? If we isolate this one, the attacker will know we are onto them and may activate other implants we have not found yet."

The team monitored for 72 hours while quietly deploying enhanced detection rules and network sensors. During that time, they identified four additional compromised systems --- including one on the backup network that would have given the attacker access to destroy all backups. When they finally executed containment, they isolated all five systems simultaneously with a coordinated action at 3 AM on Sunday. The attacker had no time to react.

If they had isolated the first system immediately, they would have missed the backup server compromise. The attacker would have known the investigation was underway and could have triggered ransomware across the remaining four systems. Patience saved them.

Preserving Volatile Evidence During Containment

Before isolating a system, capture volatile data that will be lost on shutdown:

# Capture running processes with full command lines
$ ps auxww > /evidence/$(hostname)_processes_$(date +%s).txt

# Capture network connections
$ netstat -tulnp > /evidence/$(hostname)_netstat_$(date +%s).txt
$ ss -tulnp > /evidence/$(hostname)_ss_$(date +%s).txt

# Capture routing table
$ ip route > /evidence/$(hostname)_routes_$(date +%s).txt

# Capture ARP cache (shows recent network neighbors)
$ arp -a > /evidence/$(hostname)_arp_$(date +%s).txt

# Capture logged-in users
$ who > /evidence/$(hostname)_who_$(date +%s).txt
$ w > /evidence/$(hostname)_w_$(date +%s).txt

# Capture loaded kernel modules
$ lsmod > /evidence/$(hostname)_modules_$(date +%s).txt

# Capture memory image (if forensic tools available)
# LiME for Linux:
$ sudo insmod /path/to/lime.ko "path=/evidence/$(hostname)_memory.lime format=lime"

# Capture system time and timezone (for timeline correlation)
$ date -u > /evidence/$(hostname)_time_$(date +%s).txt
$ timedatectl > /evidence/$(hostname)_timezone_$(date +%s).txt

# Hash all evidence files
$ sha256sum /evidence/$(hostname)_* > /evidence/$(hostname)_hashes.sha256

Order matters for volatile evidence collection. Memory is the most volatile (changes constantly), followed by running processes, network connections, and then disk contents. Collect in order from most volatile to least volatile. Every command you run on the compromised system changes its state (loads libraries, creates processes, allocates memory), so use minimal, pre-compiled static binaries when possible. Better yet, use a forensic toolkit USB drive with trusted tools.

Phase 4: Eradication

Once contained, the threat must be completely removed. Eradication means eliminating every artifact of the attack --- every backdoor, every persistence mechanism, every compromised credential.

Eradication Checklist

Malware Removal:
  [ ] All identified malware binaries removed or systems reimaged
  [ ] All persistence mechanisms removed:
      [ ] Scheduled tasks / cron jobs
      [ ] Startup scripts / registry Run keys
      [ ] Services / systemd units
      [ ] Web shells
      [ ] Modified system binaries
  [ ] All C2 communication channels blocked at firewall and DNS

Credential Reset:
  [ ] All compromised user passwords reset
  [ ] All potentially compromised service account passwords reset
  [ ] All API keys and tokens rotated
  [ ] Kerberos KRBTGT password reset (TWICE, with replication between resets)
  [ ] All SSH keys rotated on affected systems
  [ ] MFA re-enrolled for affected accounts

Vulnerability Remediation:
  [ ] Root cause vulnerability patched
  [ ] Same vulnerability class checked across all systems
  [ ] Configuration weaknesses that enabled spread fixed
  [ ] Network segmentation gaps addressed

Verification:
  [ ] Clean systems scanned with updated signatures
  [ ] No remaining C2 communication observed
  [ ] No new persistence mechanisms installed
  [ ] IOC sweeps clean across all systems
  [ ] Network traffic analysis shows no remaining anomalies

Why reset the KRBTGT password twice? The KRBTGT account is used to sign Kerberos tickets in Active Directory. If an attacker has obtained the KRBTGT hash (via a Golden Ticket attack), they can forge authentication tickets for any user, including domain admins. The KRBTGT password needs to be reset twice because Active Directory keeps the current and previous password. After the first reset, the old compromised hash is still valid as the "previous" password. After the second reset (with at least one replication cycle in between), the compromised hash is fully invalidated. Forgetting this step is how organizations get re-compromised weeks after an incident.

Phase 5: Recovery

Recovery is the process of bringing affected systems back to normal operations. This is not just "turn things back on" --- it requires careful verification that restored systems are clean and that the attacker cannot regain access.

Recovery Steps

Rebuild from clean images. Do not attempt to "clean" a compromised system in place. Reimage it from a known-good baseline. Rootkits and advanced malware can survive cleaning attempts.
Restore data from verified backups. Ensure backups predate the compromise. Restoring from a backup taken after the attacker had access means restoring the backdoor too.
Apply all patches. The rebuilt system must be fully patched, including the vulnerability that enabled the initial compromise.
Harden configurations. Apply security baselines (CIS benchmarks) during rebuild. The incident is an opportunity to fix configuration weaknesses.
Enhanced monitoring. Increase monitoring intensity on recovered systems for at least 30 days. Watch for any sign that the attacker maintained access through a mechanism you missed.
Gradual restoration. Do not restore all systems simultaneously. Start with the least critical, verify clean operation, then proceed to more critical systems.

Build an incident recovery checklist specific to your environment:

1. Document your rebuild process for each critical system type (web server, database, domain controller)
2. Verify your golden images are current and stored securely
3. Test a full rebuild-from-image process quarterly
4. Ensure your backup restoration process is documented and tested
5. Identify the maximum tolerable downtime for each critical system (RTO)
6. Identify the maximum acceptable data loss for each system (RPO)
7. Store this documentation outside your primary infrastructure (so it is accessible when the infrastructure is down)

Recovery Metrics and Validation

How do you know when recovery is actually complete? Recovery is not "complete" when systems are back online. It is complete when you have validated that every recovered system is clean, hardened, and monitored.

# Verify rebuilt system matches golden image hash
$ sha256sum /mnt/rebuilt/system.img
# Compare against known-good baseline stored offline

# Verify no unauthorized accounts exist
$ awk -F: '$3 >= 1000 {print $1}' /etc/passwd
# Cross-reference against HR directory --- every account must map to an active employee

# Verify no unexpected services are running
$ systemctl list-units --type=service --state=running | \
    diff - /secure/baseline/expected_services.txt
# Any differences warrant immediate investigation

# Verify no unauthorized SSH keys
$ find /home -name "authorized_keys" -exec cat {} \; | \
    diff - /secure/baseline/authorized_keys_baseline.txt

# Verify firewall rules match expected policy
$ iptables -L -n --line-numbers > /tmp/current_rules.txt
$ diff /secure/baseline/firewall_rules.txt /tmp/current_rules.txt

# Run vulnerability scan against rebuilt system
$ nmap -sV --script vulners -p- 10.0.1.50

Industry benchmarks from NIST and SANS provide context for recovery timelines:

| Severity | Target Detection Time | Target Containment | Target Recovery |
|----------|----------------------|-------------------|-----------------|
| SEV-1 (critical) | < 1 hour | < 4 hours | < 24 hours |
| SEV-2 (high) | < 4 hours | < 24 hours | < 72 hours |
| SEV-3 (medium) | < 24 hours | < 72 hours | < 1 week |
| SEV-4 (low) | < 1 week | < 2 weeks | Next maintenance window |

The 2024 IBM Cost of a Data Breach Report found that organizations with IR teams and tested IR plans identified breaches 54 days faster (mean of 204 days vs. 258 days) and contained them 68 days faster than those without. Each day of faster identification saved approximately $33,000 in breach costs. The total average cost difference between having and not having an IR plan was $2.66 million.

Most organizations significantly underestimate recovery time. A typical ransomware recovery --- from decision to rebuild through full restoration of services --- takes 2-4 weeks even with good backups. Active Directory recovery alone (rebuilding domain controllers, resetting all credentials including KRBTGT twice with a 12-hour interval, re-establishing trusts) commonly takes 3-5 days.

Forensic Evidence Preservation During Recovery

One of the biggest mistakes during recovery is destroying forensic evidence in the rush to restore services. Every system you rebuild without imaging first is evidence lost forever.

flowchart TD
    A["System identified<br/>for recovery"] --> B{"Forensic image<br/>taken?"}
    B -->|"No"| C["Create forensic image<br/>dd if=/dev/sda of=evidence.dd"]
    B -->|"Yes"| D["Verify image hash<br/>matches original"]
    C --> D
    D --> E["Document in evidence log:<br/>timestamp, hash, analyst,<br/>storage location"]
    E --> F["Rebuild from<br/>clean golden image"]
    F --> G["Apply patches and<br/>hardened configuration"]
    G --> H["Enhanced monitoring<br/>30 days minimum"]
    H --> I{"Clean for<br/>30 days?"}
    I -->|"Yes"| J["Move to standard<br/>monitoring"]
    I -->|"No / anomaly detected"| K["Re-investigate:<br/>possible missed<br/>persistence mechanism"]
    K --> C

    style A fill:#e74c3c,color:#fff
    style C fill:#f39c12,color:#fff
    style F fill:#3498db,color:#fff
    style J fill:#27ae60,color:#fff
    style K fill:#e74c3c,color:#fff

Image every compromised machine before rebuilding, even if it slows down recovery. The forensic image takes 30-60 minutes per machine with dd or a forensic imager. That is a small delay compared to the alternative: you rebuild, the attacker gets back in through a persistence mechanism you missed, and now you have no evidence of how they maintained access because you overwrote the original disk. Organizations that skip the forensic step sometimes go through three recovery cycles before they finally capture the evidence they need.

Communication During Incidents

Who tells whom, when, and what they say can make or break an incident response. Poor communication amplifies damage; clear communication contains it.

Internal Communication

sequenceDiagram
    participant SOC as SOC Analyst
    participant IC as Incident Commander
    participant CISO as CISO
    participant CEO as CEO / Exec Team
    participant Legal as Legal Counsel
    participant PR as Communications/PR
    participant Staff as All Staff

    SOC->>IC: SEV-1 incident detected<br/>Ransomware on file servers
    IC->>CISO: Briefing: scope, impact, containment status
    IC->>Legal: Notification: potential data breach<br/>Assess regulatory obligations

    CISO->>CEO: Executive briefing<br/>Business impact, ETA for resolution
    Legal->>IC: Guidance: preserve evidence,<br/>72-hour GDPR clock starts NOW

    Note over IC: Decision point: external notification needed?

    IC->>PR: Draft customer notification<br/>Draft media holding statement
    CEO->>Staff: Internal all-hands:<br/>"We are aware of an incident.<br/>IR team is responding.<br/>Do not discuss externally."

    Note over Legal: GDPR: 72 hours from awareness<br/>to notify supervisory authority
    Note over Legal: SEC: 4 business days for material<br/>cybersecurity incident (8-K)
    Note over Legal: HIPAA: 60 days to notify HHS<br/>if >500 records affected
    Note over Legal: State breach notification laws:<br/>vary by state, typically 30-60 days

External Communication Timeline

When	Who	What
Immediately	Cyber insurance carrier	Notify of potential claim. They may provide IR resources, legal counsel, and forensic firms
Within hours	External legal counsel	Engage breach counsel to manage privilege and regulatory obligations
Within hours	Forensic firm (if needed)	Engage third-party IR firm for SEV-1 incidents
Within 72 hours (GDPR)	Supervisory authority	Data breach notification if personal data of EU residents affected
Within 4 business days (SEC)	SEC filing	8-K filing for material cybersecurity incidents (public companies)
When scope is understood	Affected customers	Clear, honest notification with what happened, what data was affected, and what they should do
When ready	Media	Holding statement, then detailed statement. Never speculate publicly
When appropriate	Law enforcement	FBI IC3, Secret Service, or local law enforcement

Everything written during an incident may be discoverable in litigation. Every email, Slack message, and internal document. This is why legal counsel should be engaged immediately for SEV-1 incidents. Communications made "at the direction of legal counsel" for the purpose of obtaining legal advice may be protected by attorney-client privilege. Without this protection, your internal assessment of "we messed up, here's how the attacker got in" becomes plaintiff's exhibit A.

Article 33 of GDPR requires notification to the supervisory authority within 72 hours of becoming aware of a personal data breach. The notification must include:

Nature of the breach (what happened)
Categories and approximate number of data subjects affected
Categories and approximate number of personal data records affected
Name and contact details of the DPO
Likely consequences of the breach
Measures taken or proposed to address the breach

The 72-hour clock starts when you become aware of the breach, not when you complete your investigation. You do not need to have all the answers in 72 hours --- GDPR allows phased notification where you provide information as it becomes available. But you must make the initial notification on time. Missing the 72-hour window can result in separate fines on top of any penalties for the breach itself.

Phase 6: Post-Incident Review

The post-incident review is the most valuable phase and the most commonly skipped. Teams are exhausted, management wants to move on, and nobody wants to revisit the worst week of their career. But this is where the real learning happens.

The Blameless Retrospective

The post-incident review must be blameless. The goal is to understand what happened and how to prevent it, not to assign blame. If people fear punishment, they will hide information, and the review will be useless.

Structure:

Timeline reconstruction: Build a detailed, agreed-upon timeline of events. When was the initial compromise? When was it detected? When was it contained? Include response actions and their timing.
Root cause analysis: What was the root cause of the incident? Not just "the attacker used phishing" --- go deeper. Why did the phishing email bypass email security? Why did the compromised account have admin access? Why was the vulnerability unpatched for 60 days?
What went well: What parts of the response worked? What detection rules fired correctly? What processes saved time?
What needs improvement: What detection gaps existed? What processes slowed the response? What communication broke down?
Action items: Specific, assigned, time-bound actions. Not "improve monitoring" but "deploy Sysmon to all Windows servers by April 15 (owner: James)."

Google's approach to blameless postmortems is worth studying. Their SRE book describes postmortems that focus on systemic fixes rather than individual blame. The key insight: in complex systems, incidents are rarely caused by a single person's mistake. They result from systemic weaknesses --- inadequate monitoring, unclear procedures, missing safeguards, organizational pressure to move fast. Blaming the person who clicked the phishing email misses the systemic failures that made one click catastrophic: lack of MFA, excessive permissions, missing network segmentation, inadequate backup strategy. Fix the systems, not the people.

Tabletop Exercises

How do you practice incident response without waiting for a real incident? Tabletop exercises. They are the fire drills of cybersecurity.

Designing a Tabletop Exercise

A tabletop exercise is a structured discussion where the IR team walks through a hypothetical incident scenario. No actual systems are touched --- it is purely a discussion exercise. But it reveals gaps in plans, unclear responsibilities, and untested assumptions.

Exercise structure:

Scenario introduction: Present a realistic scenario (e.g., "Your SOC receives an alert that ransomware has been detected on three file servers in the finance department at 11 PM on Friday")
Injects: At intervals, introduce new information that changes the situation:
- Inject 1: "The ransomware is spreading to other departments via SMB"
- Inject 2: "A reporter calls asking about a 'data breach' at your company"
- Inject 3: "The attacker contacts you and demands $2 million in Bitcoin"
- Inject 4: "Legal informs you that EU customer data may be affected (GDPR clock starts)"
- Inject 5: "Your CEO asks if you should pay the ransom"
Discussion: For each inject, the team discusses: What do we do? Who is responsible? What information do we need? What are the trade-offs?
After-action review: What did we learn? What gaps did we identify? What needs to change in our IR plan?

Sample scenarios to rotate through:

Ransomware affecting production systems
Business email compromise resulting in wire fraud
Insider threat (employee exfiltrating data before departure)
Supply chain compromise (a vendor's software update contains a backdoor)
Third-party breach affecting your customers' data
Zero-day exploitation of a critical public-facing application
Physical security incident (stolen laptop with unencrypted data)

Run tabletop exercises quarterly. Include executives at least annually --- they need to practice their role in communication and decision-making. The exercise is successful when it reveals something you did not know was a problem.

Ransomware-Specific Response

Ransomware deserves its own response playbook because it presents unique challenges: time pressure, encryption of evidence, potential destruction of backups, and the ransom payment decision.

The Ransom Payment Decision

The question "should we pay?" is not primarily a technical decision --- it is a business, legal, and ethical decision. Here are the considerations:

Arguments for paying:

Business survival may depend on data recovery
Insurance may cover the payment
Some ransomware groups have reliable decryption tools (paradoxically, reliable service encourages future payments)
Cost of downtime may far exceed ransom amount

Arguments against paying:

No guarantee of decryption (some groups provide broken decryptors)
Funds criminal enterprises and incentivizes future attacks
May violate OFAC sanctions if the group is linked to a sanctioned entity (this can result in civil penalties regardless of your intentions)
You may be targeted again because you are known to pay
Payment does not remove the attacker's presence --- they may still have access

The strong recommendation is to never plan on paying. Invest in backups, segmentation, and detection so that payment is never necessary. But some organizations face existential risk if they cannot recover their data, and a blanket "never pay" policy ignores reality. The best defense against the payment decision is making sure you never have to make it.

Ransomware Response Checklist

Immediate (0-1 hours):
  [ ] Activate IR plan, assign Incident Commander
  [ ] Do NOT reboot or shut down encrypted systems (preserve memory)
  [ ] Isolate affected systems from network (disable switch ports)
  [ ] Determine if encryption is still spreading
  [ ] Preserve at least one encrypted system for forensic analysis
  [ ] Check backup integrity: are backups affected?
  [ ] Notify cyber insurance carrier
  [ ] Engage legal counsel

First 24 hours:
  [ ] Identify ransomware variant (ransom note, file extensions, IOCs)
  [ ] Check nomoreransom.org for free decryptors
  [ ] Determine initial access vector (how did it get in?)
  [ ] Assess scope: how many systems, what data affected?
  [ ] Begin backup recovery if backups are clean
  [ ] Assess regulatory notification requirements
  [ ] Executive briefing on scope, impact, recovery timeline

48-72 hours:
  [ ] Continue recovery from backups
  [ ] Patch the initial access vulnerability
  [ ] Reset all potentially compromised credentials
  [ ] Deploy enhanced monitoring for attacker persistence
  [ ] Begin regulatory notifications if required
  [ ] Customer communication if data was exfiltrated

Building Your IR Capability

Where do you start if you have no IR capability today? You start small and build.

Month 1-2: Foundation

Write a basic IR plan (even a 5-page document is better than nothing)
Identify your IR team members and alternates
Set up an out-of-band communication channel
Identify external resources: legal counsel, forensic firm, insurance

Month 3-4: Detection

Deploy essential logging (authentication, DNS, process creation)
Configure 5-10 high-fidelity detection rules in your SIEM
Set up a phishing report button for employees
Create runbooks for your most common alert types

Month 5-6: Practice

Conduct your first tabletop exercise
Test your backup restoration process
Validate your containment procedures on a test system
Review and update your IR plan based on exercise findings

Ongoing:

Quarterly tabletop exercises with rotating scenarios
Annual exercises including executive leadership
Continuous improvement of detection rules and playbooks
Regular review of external contacts and contracts

What You've Learned

In this chapter, you explored the complete incident response lifecycle:

NIST SP 800-61 lifecycle: Preparation, Detection and Analysis, Containment, Eradication, Recovery, and Post-Incident Activity form a cyclical process where lessons from each incident improve preparation for the next.
Preparation is paramount: IR plans, team roles, communication channels, and tabletop exercises must be in place before an incident occurs. The IC, Technical Lead, Communications Lead, and Legal Lead each have distinct responsibilities.
Incident severity classification: SEV-1 through SEV-4 ensures proportional response. Not every alert warrants an all-hands response; not every incident requires regulatory notification.
Containment strategy: The isolate-vs-monitor decision is one of the hardest in IR. Immediate isolation stops the bleeding but may alert the attacker and cause them to destroy evidence or activate dormant access. Monitoring first reveals scope but allows continued damage.
Evidence preservation: Volatile data (memory, processes, connections) must be captured before containment actions. Order matters: most volatile first. Use trusted tools, not tools from the compromised system.
Eradication completeness: Removing malware is not enough. Credential resets (including KRBTGT twice), persistence mechanism removal, and vulnerability remediation must all be verified.
Communication plan: Internal communication, regulatory notification (GDPR 72 hours, SEC 4 business days), customer notification, and media communication each have specific timelines and requirements. Legal counsel should be engaged immediately for SEV-1 incidents.
Blameless retrospectives: Post-incident reviews that focus on systemic improvements rather than individual blame produce better outcomes. Fix the systems, not the people.
Tabletop exercises: Quarterly exercises with realistic scenarios and injects reveal gaps that cannot be found any other way. Include executives at least annually.

The organizations that handle incidents well are not the ones that never get attacked. They are the ones that prepared, practiced, and built the muscle memory to respond calmly and effectively when the attack inevitably comes. Start building your IR capability today --- even a basic plan is infinitely better than no plan at all. And when you design your first tabletop exercise, make the scenario uncomfortable. If the exercise is easy, it is not realistic enough.

Network Security: Applied Principles & Modern Defense