Chapter 35: Incident Response
"Everyone has a plan until they get punched in the mouth." --- Mike Tyson (and every incident response team at 2 AM on a Saturday)
The Call Nobody Wants to Get
Saturday, 2:17 AM. Your phone rings. It is the on-call engineer. "Hey, uh... the monitoring dashboard is showing something weird. All our file servers just started encrypting files. And the domain controller... I think it is down."
Your first instinct might be to pull the network cables and shut everything down. That instinct is exactly what gets organizations in trouble. Panic leads to evidence destruction. Unplugging the wrong system can cause more damage than the attacker. Rebooting a compromised server clears volatile memory that contains the decryption key, the malware process, and the C2 connection details. Incident response is not about reacting. It is about following a plan you built before the crisis.
The NIST Incident Response Lifecycle
The National Institute of Standards and Technology Special Publication 800-61 defines the standard framework for incident response. Every security team should know this lifecycle cold.
stateDiagram-v2
[*] --> Preparation
Preparation --> Detection: Security event occurs
Detection --> Analysis: Alert triaged as potential incident
Analysis --> Containment: Incident confirmed
Containment --> Eradication: Threat isolated
Eradication --> Recovery: Threat removed
Recovery --> PostIncident: Systems restored
PostIncident --> Preparation: Lessons learned applied
state Preparation {
[*] --> Plans
Plans: IR plan, playbooks, communication plan
Plans --> Team
Team: Roles assigned, training complete
Team --> Tools
Tools: Forensic toolkit, jump bag ready
Tools --> Exercises
Exercises: Tabletop exercises, simulations
}
state "Detection & Analysis" as Detection {
[*] --> Monitor
Monitor: SIEM alerts, EDR, user reports
Monitor --> Triage
Triage: Is this real? What severity?
Triage --> Classify
Classify: Determine incident type and scope
}
state Containment {
[*] --> ShortTerm
ShortTerm: Isolate affected systems
ShortTerm --> Evidence
Evidence: Preserve volatile data
Evidence --> LongTerm
LongTerm: Implement temporary mitigations
}
state Eradication {
[*] --> Remove
Remove: Remove malware, close backdoors
Remove --> Patch
Patch: Fix root cause vulnerability
Patch --> Verify
Verify: Confirm threat fully removed
}
state Recovery {
[*] --> Restore
Restore: Rebuild from clean images
Restore --> Monitor2
Monitor2: Enhanced monitoring
Monitor2 --> Validate
Validate: Verify normal operations
}
state "Post-Incident" as PostIncident {
[*] --> Review
Review: Blameless retrospective
Review --> Document
Document: Update playbooks, detections
Document --> Improve
Improve: Apply lessons learned
}
The lifecycle is deliberately cyclical. Lessons learned from one incident feed directly into improved preparation for the next. The organizations that handle incidents well are the ones that invest heavily in the preparation phase --- before any incident occurs.
Phase 1: Preparation
Preparation is 80% of incident response. The time to build your IR plan, assemble your team, and practice your procedures is when nothing is on fire. Once the incident starts, you execute the plan --- you do not create it.
The Incident Response Plan
Every organization needs a written IR plan that covers:
- Scope and definitions: What constitutes an "incident" vs. a "security event"? Not every alert is an incident. Define your classification criteria
- Roles and responsibilities: Who does what? Who is authorized to make containment decisions? Who communicates with executive leadership? Who talks to the press?
- Communication channels: Out-of-band communication is essential --- if the attacker has compromised your email, you cannot use email to coordinate the response. Pre-establish a backup channel (Signal group, dedicated phone bridge, out-of-band Slack workspace)
- Escalation criteria: What triggers escalation from Tier 1 to Tier 2 to management to executive leadership to legal to external counsel?
- External contacts: Legal counsel, cyber insurance carrier, forensic firm, law enforcement contacts, regulatory notification contacts --- all pre-identified with current phone numbers
- Authority matrix: Who can authorize shutting down a production system? Who can authorize paying a ransom? Who can authorize public disclosure?
IR Team Roles
graph TD
IC["Incident Commander<br/>──────────────────<br/>Overall coordination<br/>Decision authority<br/>Resource allocation<br/>Executive communication"]
IC --> TL["Technical Lead<br/>──────────────────<br/>Investigation direction<br/>Technical decisions<br/>Evidence coordination<br/>Attack analysis"]
IC --> CL["Communications Lead<br/>──────────────────<br/>Internal comms (staff)<br/>External comms (PR)<br/>Customer notification<br/>Regulatory reporting"]
IC --> LL["Legal/Compliance Lead<br/>──────────────────<br/>Legal obligations<br/>Regulatory notification<br/>Evidence preservation<br/>Law enforcement liaison"]
TL --> FA["Forensic Analyst(s)<br/>──────────────────<br/>Host forensics<br/>Network forensics<br/>Malware analysis<br/>Timeline construction"]
TL --> SO["Systems/Network Ops<br/>──────────────────<br/>Containment execution<br/>System recovery<br/>Network changes<br/>Log collection"]
IC --> DOC["Scribe/Documenter<br/>──────────────────<br/>Timeline of all actions<br/>Decision log<br/>Evidence tracking<br/>Meeting notes"]
style IC fill:#e74c3c,color:#fff
style TL fill:#3498db,color:#fff
style CL fill:#f39c12,color:#fff
style LL fill:#9b59b6,color:#fff
style FA fill:#2ecc71,color:#fff
style SO fill:#2ecc71,color:#fff
style DOC fill:#95a5a6,color:#fff
The Incident Commander role is critical and often neglected. Without a clear IC, you get "too many cooks" --- multiple people making contradictory containment decisions, no one coordinating communication, and critical tasks falling through the cracks. The IC does not need to be the most technical person in the room. They need to be organized, calm under pressure, and decisive. Technical expertise belongs with the Technical Lead.
Phase 2: Detection and Analysis
Incident Severity Classification
Not every incident deserves the same response. A clear severity matrix ensures proportional resource allocation:
| Severity | Criteria | Response | Notification | Example |
|---|---|---|---|---|
| SEV-1 Critical | Active data breach, ransomware spreading, critical infrastructure down | All hands, 24/7 response, war room | CEO, Board, Legal, Insurance, potentially regulators | Ransomware encrypting production servers |
| SEV-2 High | Confirmed compromise, data exposure possible, significant business impact | IR team engaged, business hours + extended | CISO, VP Engineering, Legal | Compromised admin account with data access |
| SEV-3 Medium | Confirmed malware on single system, suspicious activity under investigation | IR team investigates, business hours | Security management, system owner | Malware detected and contained on workstation |
| SEV-4 Low | Policy violation, vulnerability exploitation attempt blocked | Triage and remediate, standard workflow | System owner, security team | Blocked exploit attempt, phishing email reported |
Detection Sources
Incidents are detected through multiple channels, each with different reliability and speed:
flowchart LR
subgraph "Detection Sources"
A["SIEM Alerts<br/>Automated detection rules"]
B["EDR Alerts<br/>Endpoint behavioral detection"]
C["User Reports<br/>'I clicked something weird'"]
D["Threat Intel<br/>IOC match from feed"]
E["External Notification<br/>Law enforcement, researcher,<br/>customer complaint"]
F["Anomaly Detection<br/>Unusual traffic patterns,<br/>login anomalies"]
end
subgraph "Triage"
T["Is this real?<br/>False positive check<br/>Context enrichment<br/>Scope assessment"]
end
subgraph "Classification"
S1["SEV-1: Active breach"]
S2["SEV-2: Confirmed compromise"]
S3["SEV-3: Contained threat"]
S4["SEV-4: Attempted attack"]
end
A --> T
B --> T
C --> T
D --> T
E --> T
F --> T
T --> S1
T --> S2
T --> S3
T --> S4
style S1 fill:#e74c3c,color:#fff
style S2 fill:#e67e22,color:#fff
style S3 fill:#f1c40f,color:#333
style S4 fill:#27ae60,color:#fff
What is the most common way breaches are actually detected? Historically, the most common detection source was external notification --- someone else tells you that you have been breached. A law enforcement agency finds your data on a dark web marketplace. A security researcher discovers your database exposed on the internet. A customer reports fraudulent charges. The trend is improving with better detection tooling, but external notification still accounts for a significant percentage of initial detections, and those externally-detected breaches tend to have the longest dwell times and highest costs.
Phase 3: Containment
Containment is the most time-sensitive phase. The goal is to stop the bleeding --- prevent the attacker from expanding their access, exfiltrating more data, or causing further damage --- while preserving evidence for investigation.
Containment Strategies
flowchart TD
subgraph "Short-Term Containment (Minutes to Hours)"
SC1["Network Isolation<br/>Disconnect compromised systems<br/>from network (disable switch port,<br/>change VLAN, host firewall)"]
SC2["Credential Reset<br/>Reset passwords for compromised<br/>and potentially compromised accounts<br/>Revoke active sessions/tokens"]
SC3["DNS Sinkhole<br/>Redirect C2 domains to internal<br/>sinkhole to cut attacker comms<br/>without alerting them"]
SC4["Block IOCs<br/>Block attacker IPs at firewall<br/>Block malicious domains at DNS<br/>Block file hashes at EDR"]
end
subgraph "Long-Term Containment (Hours to Days)"
LC1["Network Segmentation<br/>Create isolated VLAN for<br/>forensic analysis<br/>Restrict lateral movement paths"]
LC2["Enhanced Monitoring<br/>Deploy additional capture points<br/>Increase logging verbosity<br/>Add detection rules for this threat"]
LC3["Temporary Patches<br/>Apply emergency patches<br/>Disable vulnerable services<br/>Implement compensating controls"]
LC4["Access Review<br/>Audit all privileged accounts<br/>Disable unnecessary access<br/>Enforce MFA on all admin accounts"]
end
SC1 --> LC1
SC2 --> LC4
SC3 --> LC2
SC4 --> LC3
Critical Containment Decision: Isolate vs. Monitor
This is one of the hardest decisions in incident response. Do you immediately isolate the compromised system, which stops the attacker but also tips them off and may cause them to destroy evidence? Or do you monitor the system to understand the full scope of the compromise before containment, which gives you better intelligence but allows the attacker to continue operating?
The answer depends on the situation:
Isolate immediately when:
- Active data destruction (ransomware encrypting files)
- Active exfiltration of highly sensitive data (PII, financial data, classified information)
- Attacker has access to critical infrastructure (domain controllers, backup systems)
- Risk of spread is high and imminent
Monitor first when:
- You are unsure of the full scope (how many systems are compromised?)
- The attacker appears dormant or slow-moving
- You need to identify C2 infrastructure to block comprehensively
- Legal or law enforcement requests continued monitoring for attribution
- Early isolation would tip off the attacker, causing them to activate dormant implants on other systems
During a major incident, a team discovered a compromised server that was beaconing to a C2 server every 30 minutes. The initial impulse was to isolate it immediately, but the lead responder pushed back: "We know about this one system. How do we know there are not others? If we isolate this one, the attacker will know we are onto them and may activate other implants we have not found yet."
The team monitored for 72 hours while quietly deploying enhanced detection rules and network sensors. During that time, they identified four additional compromised systems --- including one on the backup network that would have given the attacker access to destroy all backups. When they finally executed containment, they isolated all five systems simultaneously with a coordinated action at 3 AM on Sunday. The attacker had no time to react.
If they had isolated the first system immediately, they would have missed the backup server compromise. The attacker would have known the investigation was underway and could have triggered ransomware across the remaining four systems. Patience saved them.
Preserving Volatile Evidence During Containment
Before isolating a system, capture volatile data that will be lost on shutdown:
# Capture running processes with full command lines
$ ps auxww > /evidence/$(hostname)_processes_$(date +%s).txt
# Capture network connections
$ netstat -tulnp > /evidence/$(hostname)_netstat_$(date +%s).txt
$ ss -tulnp > /evidence/$(hostname)_ss_$(date +%s).txt
# Capture routing table
$ ip route > /evidence/$(hostname)_routes_$(date +%s).txt
# Capture ARP cache (shows recent network neighbors)
$ arp -a > /evidence/$(hostname)_arp_$(date +%s).txt
# Capture logged-in users
$ who > /evidence/$(hostname)_who_$(date +%s).txt
$ w > /evidence/$(hostname)_w_$(date +%s).txt
# Capture loaded kernel modules
$ lsmod > /evidence/$(hostname)_modules_$(date +%s).txt
# Capture memory image (if forensic tools available)
# LiME for Linux:
$ sudo insmod /path/to/lime.ko "path=/evidence/$(hostname)_memory.lime format=lime"
# Capture system time and timezone (for timeline correlation)
$ date -u > /evidence/$(hostname)_time_$(date +%s).txt
$ timedatectl > /evidence/$(hostname)_timezone_$(date +%s).txt
# Hash all evidence files
$ sha256sum /evidence/$(hostname)_* > /evidence/$(hostname)_hashes.sha256
Order matters for volatile evidence collection. Memory is the most volatile (changes constantly), followed by running processes, network connections, and then disk contents. Collect in order from most volatile to least volatile. Every command you run on the compromised system changes its state (loads libraries, creates processes, allocates memory), so use minimal, pre-compiled static binaries when possible. Better yet, use a forensic toolkit USB drive with trusted tools.
Phase 4: Eradication
Once contained, the threat must be completely removed. Eradication means eliminating every artifact of the attack --- every backdoor, every persistence mechanism, every compromised credential.
Eradication Checklist
Malware Removal:
[ ] All identified malware binaries removed or systems reimaged
[ ] All persistence mechanisms removed:
[ ] Scheduled tasks / cron jobs
[ ] Startup scripts / registry Run keys
[ ] Services / systemd units
[ ] Web shells
[ ] Modified system binaries
[ ] All C2 communication channels blocked at firewall and DNS
Credential Reset:
[ ] All compromised user passwords reset
[ ] All potentially compromised service account passwords reset
[ ] All API keys and tokens rotated
[ ] Kerberos KRBTGT password reset (TWICE, with replication between resets)
[ ] All SSH keys rotated on affected systems
[ ] MFA re-enrolled for affected accounts
Vulnerability Remediation:
[ ] Root cause vulnerability patched
[ ] Same vulnerability class checked across all systems
[ ] Configuration weaknesses that enabled spread fixed
[ ] Network segmentation gaps addressed
Verification:
[ ] Clean systems scanned with updated signatures
[ ] No remaining C2 communication observed
[ ] No new persistence mechanisms installed
[ ] IOC sweeps clean across all systems
[ ] Network traffic analysis shows no remaining anomalies
Why reset the KRBTGT password twice? The KRBTGT account is used to sign Kerberos tickets in Active Directory. If an attacker has obtained the KRBTGT hash (via a Golden Ticket attack), they can forge authentication tickets for any user, including domain admins. The KRBTGT password needs to be reset twice because Active Directory keeps the current and previous password. After the first reset, the old compromised hash is still valid as the "previous" password. After the second reset (with at least one replication cycle in between), the compromised hash is fully invalidated. Forgetting this step is how organizations get re-compromised weeks after an incident.
Phase 5: Recovery
Recovery is the process of bringing affected systems back to normal operations. This is not just "turn things back on" --- it requires careful verification that restored systems are clean and that the attacker cannot regain access.
Recovery Steps
-
Rebuild from clean images. Do not attempt to "clean" a compromised system in place. Reimage it from a known-good baseline. Rootkits and advanced malware can survive cleaning attempts.
-
Restore data from verified backups. Ensure backups predate the compromise. Restoring from a backup taken after the attacker had access means restoring the backdoor too.
-
Apply all patches. The rebuilt system must be fully patched, including the vulnerability that enabled the initial compromise.
-
Harden configurations. Apply security baselines (CIS benchmarks) during rebuild. The incident is an opportunity to fix configuration weaknesses.
-
Enhanced monitoring. Increase monitoring intensity on recovered systems for at least 30 days. Watch for any sign that the attacker maintained access through a mechanism you missed.
-
Gradual restoration. Do not restore all systems simultaneously. Start with the least critical, verify clean operation, then proceed to more critical systems.
Build an incident recovery checklist specific to your environment:
1. Document your rebuild process for each critical system type (web server, database, domain controller)
2. Verify your golden images are current and stored securely
3. Test a full rebuild-from-image process quarterly
4. Ensure your backup restoration process is documented and tested
5. Identify the maximum tolerable downtime for each critical system (RTO)
6. Identify the maximum acceptable data loss for each system (RPO)
7. Store this documentation outside your primary infrastructure (so it is accessible when the infrastructure is down)
Recovery Metrics and Validation
How do you know when recovery is actually complete? Recovery is not "complete" when systems are back online. It is complete when you have validated that every recovered system is clean, hardened, and monitored.
# Verify rebuilt system matches golden image hash
$ sha256sum /mnt/rebuilt/system.img
# Compare against known-good baseline stored offline
# Verify no unauthorized accounts exist
$ awk -F: '$3 >= 1000 {print $1}' /etc/passwd
# Cross-reference against HR directory --- every account must map to an active employee
# Verify no unexpected services are running
$ systemctl list-units --type=service --state=running | \
diff - /secure/baseline/expected_services.txt
# Any differences warrant immediate investigation
# Verify no unauthorized SSH keys
$ find /home -name "authorized_keys" -exec cat {} \; | \
diff - /secure/baseline/authorized_keys_baseline.txt
# Verify firewall rules match expected policy
$ iptables -L -n --line-numbers > /tmp/current_rules.txt
$ diff /secure/baseline/firewall_rules.txt /tmp/current_rules.txt
# Run vulnerability scan against rebuilt system
$ nmap -sV --script vulners -p- 10.0.1.50
Industry benchmarks from NIST and SANS provide context for recovery timelines:
| Severity | Target Detection Time | Target Containment | Target Recovery |
|----------|----------------------|-------------------|-----------------|
| SEV-1 (critical) | < 1 hour | < 4 hours | < 24 hours |
| SEV-2 (high) | < 4 hours | < 24 hours | < 72 hours |
| SEV-3 (medium) | < 24 hours | < 72 hours | < 1 week |
| SEV-4 (low) | < 1 week | < 2 weeks | Next maintenance window |
The 2024 IBM Cost of a Data Breach Report found that organizations with IR teams and tested IR plans identified breaches 54 days faster (mean of 204 days vs. 258 days) and contained them 68 days faster than those without. Each day of faster identification saved approximately $33,000 in breach costs. The total average cost difference between having and not having an IR plan was $2.66 million.
Most organizations significantly underestimate recovery time. A typical ransomware recovery --- from decision to rebuild through full restoration of services --- takes 2-4 weeks even with good backups. Active Directory recovery alone (rebuilding domain controllers, resetting all credentials including KRBTGT twice with a 12-hour interval, re-establishing trusts) commonly takes 3-5 days.
Forensic Evidence Preservation During Recovery
One of the biggest mistakes during recovery is destroying forensic evidence in the rush to restore services. Every system you rebuild without imaging first is evidence lost forever.
flowchart TD
A["System identified<br/>for recovery"] --> B{"Forensic image<br/>taken?"}
B -->|"No"| C["Create forensic image<br/>dd if=/dev/sda of=evidence.dd"]
B -->|"Yes"| D["Verify image hash<br/>matches original"]
C --> D
D --> E["Document in evidence log:<br/>timestamp, hash, analyst,<br/>storage location"]
E --> F["Rebuild from<br/>clean golden image"]
F --> G["Apply patches and<br/>hardened configuration"]
G --> H["Enhanced monitoring<br/>30 days minimum"]
H --> I{"Clean for<br/>30 days?"}
I -->|"Yes"| J["Move to standard<br/>monitoring"]
I -->|"No / anomaly detected"| K["Re-investigate:<br/>possible missed<br/>persistence mechanism"]
K --> C
style A fill:#e74c3c,color:#fff
style C fill:#f39c12,color:#fff
style F fill:#3498db,color:#fff
style J fill:#27ae60,color:#fff
style K fill:#e74c3c,color:#fff
Image every compromised machine before rebuilding, even if it slows down recovery. The forensic image takes 30-60 minutes per machine with dd or a forensic imager. That is a small delay compared to the alternative: you rebuild, the attacker gets back in through a persistence mechanism you missed, and now you have no evidence of how they maintained access because you overwrote the original disk. Organizations that skip the forensic step sometimes go through three recovery cycles before they finally capture the evidence they need.
Communication During Incidents
Who tells whom, when, and what they say can make or break an incident response. Poor communication amplifies damage; clear communication contains it.
Internal Communication
sequenceDiagram
participant SOC as SOC Analyst
participant IC as Incident Commander
participant CISO as CISO
participant CEO as CEO / Exec Team
participant Legal as Legal Counsel
participant PR as Communications/PR
participant Staff as All Staff
SOC->>IC: SEV-1 incident detected<br/>Ransomware on file servers
IC->>CISO: Briefing: scope, impact, containment status
IC->>Legal: Notification: potential data breach<br/>Assess regulatory obligations
CISO->>CEO: Executive briefing<br/>Business impact, ETA for resolution
Legal->>IC: Guidance: preserve evidence,<br/>72-hour GDPR clock starts NOW
Note over IC: Decision point: external notification needed?
IC->>PR: Draft customer notification<br/>Draft media holding statement
CEO->>Staff: Internal all-hands:<br/>"We are aware of an incident.<br/>IR team is responding.<br/>Do not discuss externally."
Note over Legal: GDPR: 72 hours from awareness<br/>to notify supervisory authority
Note over Legal: SEC: 4 business days for material<br/>cybersecurity incident (8-K)
Note over Legal: HIPAA: 60 days to notify HHS<br/>if >500 records affected
Note over Legal: State breach notification laws:<br/>vary by state, typically 30-60 days
External Communication Timeline
| When | Who | What |
|---|---|---|
| Immediately | Cyber insurance carrier | Notify of potential claim. They may provide IR resources, legal counsel, and forensic firms |
| Within hours | External legal counsel | Engage breach counsel to manage privilege and regulatory obligations |
| Within hours | Forensic firm (if needed) | Engage third-party IR firm for SEV-1 incidents |
| Within 72 hours (GDPR) | Supervisory authority | Data breach notification if personal data of EU residents affected |
| Within 4 business days (SEC) | SEC filing | 8-K filing for material cybersecurity incidents (public companies) |
| When scope is understood | Affected customers | Clear, honest notification with what happened, what data was affected, and what they should do |
| When ready | Media | Holding statement, then detailed statement. Never speculate publicly |
| When appropriate | Law enforcement | FBI IC3, Secret Service, or local law enforcement |
Everything written during an incident may be discoverable in litigation. Every email, Slack message, and internal document. This is why legal counsel should be engaged immediately for SEV-1 incidents. Communications made "at the direction of legal counsel" for the purpose of obtaining legal advice may be protected by attorney-client privilege. Without this protection, your internal assessment of "we messed up, here's how the attacker got in" becomes plaintiff's exhibit A.
GDPR 72-Hour Notification
Article 33 of GDPR requires notification to the supervisory authority within 72 hours of becoming aware of a personal data breach. The notification must include:
- Nature of the breach (what happened)
- Categories and approximate number of data subjects affected
- Categories and approximate number of personal data records affected
- Name and contact details of the DPO
- Likely consequences of the breach
- Measures taken or proposed to address the breach
The 72-hour clock starts when you become aware of the breach, not when you complete your investigation. You do not need to have all the answers in 72 hours --- GDPR allows phased notification where you provide information as it becomes available. But you must make the initial notification on time. Missing the 72-hour window can result in separate fines on top of any penalties for the breach itself.
Phase 6: Post-Incident Review
The post-incident review is the most valuable phase and the most commonly skipped. Teams are exhausted, management wants to move on, and nobody wants to revisit the worst week of their career. But this is where the real learning happens.
The Blameless Retrospective
The post-incident review must be blameless. The goal is to understand what happened and how to prevent it, not to assign blame. If people fear punishment, they will hide information, and the review will be useless.
Structure:
- Timeline reconstruction: Build a detailed, agreed-upon timeline of events. When was the initial compromise? When was it detected? When was it contained? Include response actions and their timing.
- Root cause analysis: What was the root cause of the incident? Not just "the attacker used phishing" --- go deeper. Why did the phishing email bypass email security? Why did the compromised account have admin access? Why was the vulnerability unpatched for 60 days?
- What went well: What parts of the response worked? What detection rules fired correctly? What processes saved time?
- What needs improvement: What detection gaps existed? What processes slowed the response? What communication broke down?
- Action items: Specific, assigned, time-bound actions. Not "improve monitoring" but "deploy Sysmon to all Windows servers by April 15 (owner: James)."
Google's approach to blameless postmortems is worth studying. Their SRE book describes postmortems that focus on systemic fixes rather than individual blame. The key insight: in complex systems, incidents are rarely caused by a single person's mistake. They result from systemic weaknesses --- inadequate monitoring, unclear procedures, missing safeguards, organizational pressure to move fast. Blaming the person who clicked the phishing email misses the systemic failures that made one click catastrophic: lack of MFA, excessive permissions, missing network segmentation, inadequate backup strategy. Fix the systems, not the people.
Tabletop Exercises
How do you practice incident response without waiting for a real incident? Tabletop exercises. They are the fire drills of cybersecurity.
Designing a Tabletop Exercise
A tabletop exercise is a structured discussion where the IR team walks through a hypothetical incident scenario. No actual systems are touched --- it is purely a discussion exercise. But it reveals gaps in plans, unclear responsibilities, and untested assumptions.
Exercise structure:
-
Scenario introduction: Present a realistic scenario (e.g., "Your SOC receives an alert that ransomware has been detected on three file servers in the finance department at 11 PM on Friday")
-
Injects: At intervals, introduce new information that changes the situation:
- Inject 1: "The ransomware is spreading to other departments via SMB"
- Inject 2: "A reporter calls asking about a 'data breach' at your company"
- Inject 3: "The attacker contacts you and demands $2 million in Bitcoin"
- Inject 4: "Legal informs you that EU customer data may be affected (GDPR clock starts)"
- Inject 5: "Your CEO asks if you should pay the ransom"
-
Discussion: For each inject, the team discusses: What do we do? Who is responsible? What information do we need? What are the trade-offs?
-
After-action review: What did we learn? What gaps did we identify? What needs to change in our IR plan?
Sample scenarios to rotate through:
- Ransomware affecting production systems
- Business email compromise resulting in wire fraud
- Insider threat (employee exfiltrating data before departure)
- Supply chain compromise (a vendor's software update contains a backdoor)
- Third-party breach affecting your customers' data
- Zero-day exploitation of a critical public-facing application
- Physical security incident (stolen laptop with unencrypted data)
Run tabletop exercises quarterly. Include executives at least annually --- they need to practice their role in communication and decision-making. The exercise is successful when it reveals something you did not know was a problem.
Ransomware-Specific Response
Ransomware deserves its own response playbook because it presents unique challenges: time pressure, encryption of evidence, potential destruction of backups, and the ransom payment decision.
The Ransom Payment Decision
The question "should we pay?" is not primarily a technical decision --- it is a business, legal, and ethical decision. Here are the considerations:
Arguments for paying:
- Business survival may depend on data recovery
- Insurance may cover the payment
- Some ransomware groups have reliable decryption tools (paradoxically, reliable service encourages future payments)
- Cost of downtime may far exceed ransom amount
Arguments against paying:
- No guarantee of decryption (some groups provide broken decryptors)
- Funds criminal enterprises and incentivizes future attacks
- May violate OFAC sanctions if the group is linked to a sanctioned entity (this can result in civil penalties regardless of your intentions)
- You may be targeted again because you are known to pay
- Payment does not remove the attacker's presence --- they may still have access
The strong recommendation is to never plan on paying. Invest in backups, segmentation, and detection so that payment is never necessary. But some organizations face existential risk if they cannot recover their data, and a blanket "never pay" policy ignores reality. The best defense against the payment decision is making sure you never have to make it.
Ransomware Response Checklist
Immediate (0-1 hours):
[ ] Activate IR plan, assign Incident Commander
[ ] Do NOT reboot or shut down encrypted systems (preserve memory)
[ ] Isolate affected systems from network (disable switch ports)
[ ] Determine if encryption is still spreading
[ ] Preserve at least one encrypted system for forensic analysis
[ ] Check backup integrity: are backups affected?
[ ] Notify cyber insurance carrier
[ ] Engage legal counsel
First 24 hours:
[ ] Identify ransomware variant (ransom note, file extensions, IOCs)
[ ] Check nomoreransom.org for free decryptors
[ ] Determine initial access vector (how did it get in?)
[ ] Assess scope: how many systems, what data affected?
[ ] Begin backup recovery if backups are clean
[ ] Assess regulatory notification requirements
[ ] Executive briefing on scope, impact, recovery timeline
48-72 hours:
[ ] Continue recovery from backups
[ ] Patch the initial access vulnerability
[ ] Reset all potentially compromised credentials
[ ] Deploy enhanced monitoring for attacker persistence
[ ] Begin regulatory notifications if required
[ ] Customer communication if data was exfiltrated
Building Your IR Capability
Where do you start if you have no IR capability today? You start small and build.
Month 1-2: Foundation
- Write a basic IR plan (even a 5-page document is better than nothing)
- Identify your IR team members and alternates
- Set up an out-of-band communication channel
- Identify external resources: legal counsel, forensic firm, insurance
Month 3-4: Detection
- Deploy essential logging (authentication, DNS, process creation)
- Configure 5-10 high-fidelity detection rules in your SIEM
- Set up a phishing report button for employees
- Create runbooks for your most common alert types
Month 5-6: Practice
- Conduct your first tabletop exercise
- Test your backup restoration process
- Validate your containment procedures on a test system
- Review and update your IR plan based on exercise findings
Ongoing:
- Quarterly tabletop exercises with rotating scenarios
- Annual exercises including executive leadership
- Continuous improvement of detection rules and playbooks
- Regular review of external contacts and contracts
What You've Learned
In this chapter, you explored the complete incident response lifecycle:
-
NIST SP 800-61 lifecycle: Preparation, Detection and Analysis, Containment, Eradication, Recovery, and Post-Incident Activity form a cyclical process where lessons from each incident improve preparation for the next.
-
Preparation is paramount: IR plans, team roles, communication channels, and tabletop exercises must be in place before an incident occurs. The IC, Technical Lead, Communications Lead, and Legal Lead each have distinct responsibilities.
-
Incident severity classification: SEV-1 through SEV-4 ensures proportional response. Not every alert warrants an all-hands response; not every incident requires regulatory notification.
-
Containment strategy: The isolate-vs-monitor decision is one of the hardest in IR. Immediate isolation stops the bleeding but may alert the attacker and cause them to destroy evidence or activate dormant access. Monitoring first reveals scope but allows continued damage.
-
Evidence preservation: Volatile data (memory, processes, connections) must be captured before containment actions. Order matters: most volatile first. Use trusted tools, not tools from the compromised system.
-
Eradication completeness: Removing malware is not enough. Credential resets (including KRBTGT twice), persistence mechanism removal, and vulnerability remediation must all be verified.
-
Communication plan: Internal communication, regulatory notification (GDPR 72 hours, SEC 4 business days), customer notification, and media communication each have specific timelines and requirements. Legal counsel should be engaged immediately for SEV-1 incidents.
-
Blameless retrospectives: Post-incident reviews that focus on systemic improvements rather than individual blame produce better outcomes. Fix the systems, not the people.
-
Tabletop exercises: Quarterly exercises with realistic scenarios and injects reveal gaps that cannot be found any other way. Include executives at least annually.
The organizations that handle incidents well are not the ones that never get attacked. They are the ones that prepared, practiced, and built the muscle memory to respond calmly and effectively when the attack inevitably comes. Start building your IR capability today --- even a basic plan is infinitely better than no plan at all. And when you design your first tabletop exercise, make the scenario uncomfortable. If the exercise is easy, it is not realistic enough.