Troubleshooting Methodology
Why This Matters
It is 2am. The monitoring system is screaming. The website is down. Customers are tweeting about it. Your manager is on Slack asking for updates every three minutes. And you have no idea what is wrong.
This is the moment that separates an experienced Linux administrator from a beginner. Not because the experienced admin magically knows the answer, but because they have a systematic approach to finding it. They do not panic. They do not randomly restart services hoping something sticks. They follow a methodology.
Every chapter in this book has taught you specific skills: networking, storage, processes, services, security. This chapter teaches you how to combine those skills under pressure into a systematic troubleshooting process. This is arguably the most important chapter in the book, because real-world Linux work is primarily troubleshooting.
Try This Right Now
The next time something goes wrong on your system, resist the urge to immediately start fixing it. Instead, spend 60 seconds gathering information:
# What is the system's overall health?
$ uptime
$ free -m
$ df -h
$ dmesg | tail -20
# What changed recently?
$ last -10
$ journalctl --since "1 hour ago" -p err
$ rpm -qa --last | head -10 # RHEL-family
$ zcat /var/log/apt/history.log.*.gz | head -20 # Debian-family
# What is happening right now?
$ top -bn1 | head -20
$ ss -tlnp
$ systemctl --failed
Those commands take 30 seconds to run and will tell you more than 10 minutes of guessing.
The Systematic Troubleshooting Process
┌──────────────────────────────────────────────────────────────┐
│ SYSTEMATIC TROUBLESHOOTING │
│ │
│ 1. DEFINE THE PROBLEM │
│ What exactly is broken? What should be happening? │
│ │ │
│ ▼ │
│ 2. GATHER INFORMATION │
│ Logs, metrics, error messages, recent changes │
│ │ │
│ ▼ │
│ 3. FORM A HYPOTHESIS │
│ Based on evidence, what is the most likely cause? │
│ │ │
│ ▼ │
│ 4. TEST THE HYPOTHESIS │
│ Design a test that proves or disproves your theory │
│ │ │
│ ▼ │
│ 5. IMPLEMENT THE FIX │
│ Apply the solution │
│ │ │
│ ▼ │
│ 6. VERIFY │
│ Confirm the problem is actually resolved │
│ │ │
│ ▼ │
│ 7. DOCUMENT │
│ Record what happened, what caused it, how it was fixed │
│ │
│ If your hypothesis is wrong at step 4, go back to step 3. │
│ Do NOT skip step 6 -- "it seems to work" is not enough. │
│ Do NOT skip step 7 -- future you will thank present you. │
│ │
└──────────────────────────────────────────────────────────────┘
Step 1: Define the Problem
Before you can fix something, you must understand what is broken. Vague problem statements lead to wasted effort.
Bad: "The server is down." Better: "Users cannot load the website. The server responds to ping but port 443 returns connection refused."
Bad: "It's slow." Better: "Page load times increased from 200ms to 8 seconds starting at 14:00 today."
Ask:
- What is the expected behavior?
- What is the actual behavior?
- When did it start?
- Who is affected?
- What changed before it started?
Step 2: Gather Information
This is where your Linux toolbox comes in. Gather facts, not opinions.
# System overview
$ uptime # Load and uptime
$ free -m # Memory usage
$ df -h # Disk usage
$ top -bn1 | head -30 # Top processes
# Recent events
$ journalctl --since "30 min ago" -p err --no-pager
$ dmesg | tail -30 # Kernel messages
$ last -5 # Recent logins
# Service status
$ systemctl status <service> # Specific service
$ systemctl --failed # All failed services
# Network
$ ss -tlnp # Listening ports
$ ip addr # IP addresses
$ ping -c3 <gateway> # Basic connectivity
$ dig <hostname> # DNS resolution
# Recent changes
$ journalctl -u <service> --since "1 hour ago"
$ stat /etc/<config-file> # When was config last changed?
Step 3: Form a Hypothesis
Based on the evidence, propose the most likely cause. Start with the simplest explanation -- is the service running? Is the disk full? Is the network up?
Step 4: Test the Hypothesis
Design a test that will either confirm or eliminate your hypothesis. Do not change multiple things at once -- that makes it impossible to know what fixed the problem.
Step 5: Implement the Fix
Apply the minimum change needed to resolve the issue. Document what you change before you change it.
Step 6: Verify
Confirm the problem is fully resolved, not just partially. Check from the user's perspective, not just from the server.
Step 7: Document
Write down:
- What the symptoms were
- What caused the problem
- What you did to fix it
- How to prevent it in the future
The 5 Whys Technique
When you find the immediate cause, keep asking "Why?" to find the root cause.
Problem: The website is down.
Why? → Nginx is not running.
Why? → It crashed due to out-of-memory.
Why? → A memory leak in the PHP application.
Why? → A new deployment introduced a bug in session handling.
Why? → The code change was not reviewed and had no tests.
ROOT CAUSE: Missing code review and test coverage.
FIX: Fix the memory leak AND add code review + tests.
If you only fix "Nginx is not running" by restarting it, the problem will return. The 5 Whys drives you to the real root cause.
Think About It: A server's disk filled up because log files grew too large. "Delete the logs" fixes the immediate problem. What are the 5 Whys, and what is the real fix?
Reading Error Messages and Logs
The most underrated troubleshooting skill is actually reading the error message. Most errors tell you exactly what is wrong if you read carefully.
Common Error Patterns
┌──────────────────────────────────────────────────────────────┐
│ COMMON ERROR MESSAGES AND WHAT THEY MEAN │
│ │
│ "Permission denied" │
│ → File permissions, SELinux/AppArmor, or capability issue │
│ Check: ls -la, getenforce, journalctl for AVC denials │
│ │
│ "No such file or directory" │
│ → Path is wrong, file was deleted, or filesystem not mounted│
│ Check: ls, mount, findmnt │
│ │
│ "Connection refused" │
│ → Service is not running or not listening on that port │
│ Check: systemctl status, ss -tlnp │
│ │
│ "Connection timed out" │
│ → Firewall blocking, network unreachable, or service hung │
│ Check: iptables, ping, traceroute │
│ │
│ "No space left on device" │
│ → Disk full OR inodes exhausted │
│ Check: df -h, df -i │
│ │
│ "Address already in use" │
│ → Another process is using that port │
│ Check: ss -tlnp | grep <port> │
│ │
│ "Name or service not known" │
│ → DNS resolution failure │
│ Check: dig, cat /etc/resolv.conf, systemd-resolve --status│
│ │
│ "Out of memory: Killed process" │
│ → OOM killer terminated a process │
│ Check: dmesg | grep -i oom, journalctl -k │
│ │
│ "Segmentation fault" │
│ → Application bug (accessing invalid memory) │
│ Check: coredump, application logs │
│ │
└──────────────────────────────────────────────────────────────┘
Where to Find Logs
# Systemd journal (most services)
$ journalctl -u <service-name> --since "1 hour ago"
# System-wide errors
$ journalctl -p err --since today
# Kernel messages
$ dmesg | tail -50
# Traditional log files
$ ls /var/log/
$ tail -50 /var/log/syslog # Debian/Ubuntu
$ tail -50 /var/log/messages # RHEL-family
# Application-specific logs
$ tail -50 /var/log/nginx/error.log
$ tail -50 /var/log/postgresql/postgresql-15-main.log
$ tail -50 /var/log/mysql/error.log
Troubleshooting Scenarios
Let us walk through the most common real-world problems with a systematic approach.
Scenario 1: Cannot SSH into a Server
# Step 1: Define the problem
# "SSH connection to 10.0.0.5 hangs and eventually times out"
# Step 2: Gather information
# From another machine that CAN reach the server (or console access):
# Is the machine up?
$ ping -c3 10.0.0.5
# Is SSH listening?
$ ss -tlnp | grep :22
# Is sshd running?
$ systemctl status sshd
# Check firewall
$ sudo iptables -L -n | grep 22
$ sudo firewall-cmd --list-all
# Check SSH config
$ sudo sshd -T | grep -i "listen\|permit\|allow\|deny"
# Check for failed login attempts
$ journalctl -u sshd --since "1 hour ago" | tail -20
# Check if fail2ban blocked the IP
$ sudo fail2ban-client status sshd
Common causes and fixes:
- sshd not running:
sudo systemctl start sshd - Firewall blocking:
sudo firewall-cmd --add-service=ssh --permanent && sudo firewall-cmd --reload - fail2ban banned the IP:
sudo fail2ban-client set sshd unbanip <IP> /etc/hosts.denyblocking: Check and edit- Wrong port: Check
Portdirective in/etc/ssh/sshd_config - Key authentication failed: Check
~/.ssh/authorized_keyspermissions (must be 600)
Scenario 2: Website Down (HTTP Error)
# Step 1: What does the user see?
$ curl -I http://example.com
# Connection refused? 500 error? Timeout?
# Step 2: Check the web server
$ systemctl status nginx
# or
$ systemctl status apache2
# Check error logs
$ tail -30 /var/log/nginx/error.log
# Is it listening?
$ ss -tlnp | grep -E ':80|:443'
# Check config syntax
$ nginx -t
# Check backend application
$ systemctl status myapp
# Check disk space (full disk = can't write logs = crash)
$ df -h
# Check memory (OOM = killed process)
$ free -m
$ dmesg | grep -i oom
Common causes:
- Web server not running (restart it)
- Config syntax error (fix config, run
nginx -t) - Backend application crashed (check app logs)
- Disk full (clean up, check log rotation)
- Permissions changed on document root
- SSL certificate expired (check with
openssl s_client)
Scenario 3: Disk Full
# Step 1: Confirm and identify
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 50G 0 100% /
# Step 2: Find what is using the space
$ sudo du -sh /* 2>/dev/null | sort -rh | head -10
28G /var
12G /home
5G /usr
3G /opt
$ sudo du -sh /var/* | sort -rh | head -5
25G /var/log
2G /var/lib
$ sudo du -sh /var/log/* | sort -rh | head -5
22G /var/log/myapp
2G /var/log/syslog
# Step 3: Identify the specific culprit
$ sudo ls -lhS /var/log/myapp/ | head -5
-rw-r--r-- 1 myapp myapp 20G Jun 15 14:32 application.log
# Step 4: Fix
# Immediate: truncate the file (not delete -- deleting a file held open does not free space)
$ sudo truncate -s 0 /var/log/myapp/application.log
# Long-term: set up log rotation
$ sudo cat > /etc/logrotate.d/myapp << 'EOF'
/var/log/myapp/*.log {
daily
rotate 7
compress
missingok
notifempty
copytruncate
maxsize 500M
}
EOF
Safety Warning: If you delete a file that is still open by a process, the space is not freed until the process releases the file handle. Use
truncate -s 0instead, or restart the process after deleting. Check withlsof +L1to find deleted-but-open files.
Scenario 4: High Load Average
# Step 1: Check load average
$ uptime
14:32:07 up 30 days, load average: 24.5, 22.3, 18.7
# On a 4-CPU system, load > 4 means processes are waiting
# Step 2: Identify what is causing the load
$ top -bn1 | head -20
# Look at CPU% and state columns
# Is it CPU-bound or I/O-bound?
$ vmstat 1 5
# High 'wa' (wait) = I/O bound
# High 'us' (user) or 'sy' (system) = CPU bound
# If I/O bound, check disk I/O
$ iostat -x 1 5
# Look for high %util, high await
# If CPU bound, find the hungry processes
$ ps aux --sort=-%cpu | head -10
# Check for process storms
$ ps aux | wc -l
# If unusually high, something might be forking excessively
Scenario 5: Service Will Not Start
# Step 1: Check the status
$ systemctl status myservice
# Read the error message -- it usually tells you exactly what is wrong
# Step 2: Check the journal
$ journalctl -u myservice --since "5 min ago" --no-pager
# Step 3: Common causes checklist
# Config syntax error?
$ myservice --check-config # Many services support this
# Missing dependency?
$ systemctl list-dependencies myservice
# Port already in use?
$ ss -tlnp | grep <port>
# Permission issue?
$ ls -la /etc/myservice/
$ ls -la /var/run/myservice/
$ namei -l /var/run/myservice/myservice.sock # Check path permissions
# SELinux blocking?
$ sudo ausearch -m avc --start recent
$ sudo sealert -a /var/log/audit/audit.log
Scenario 6: Network Unreachable
# Step 1: Check local network config
$ ip addr
$ ip route
# Step 2: Can you reach the gateway?
$ ping -c3 $(ip route | grep default | awk '{print $3}')
# Step 3: Can you reach external IPs?
$ ping -c3 1.1.1.1
# Step 4: Is DNS working?
$ dig google.com
# or
$ nslookup google.com
# Step 5: Check for firewall issues
$ sudo iptables -L -n
$ sudo nft list ruleset
# Step 6: Check physical layer
$ ip link show
$ ethtool eth0 | grep -i "link detected"
# Decision tree:
# Can't reach gateway → local network/cable/interface issue
# Can reach gateway but not internet → routing or upstream issue
# Can reach IPs but not names → DNS issue
# Can reach some hosts but not others → firewall or routing issue
Scenario 7: DNS Not Resolving
# Step 1: Confirm DNS is the issue
$ ping 1.1.1.1 # Works? Then network is fine.
$ ping google.com # Fails? DNS is the problem.
# Step 2: Check DNS configuration
$ cat /etc/resolv.conf
$ resolvectl status # systemd-resolved systems
# Step 3: Test DNS directly
$ dig @1.1.1.1 google.com # Use a known-good DNS server
$ dig @$(grep nameserver /etc/resolv.conf | head -1 | awk '{print $2}') google.com
# Step 4: Check if systemd-resolved is running
$ systemctl status systemd-resolved
# Step 5: Common fixes
# Add a working nameserver
$ echo "nameserver 1.1.1.1" | sudo tee /etc/resolv.conf
# Restart systemd-resolved
$ sudo systemctl restart systemd-resolved
Scenario 8: Permission Denied
# Step 1: Check standard Unix permissions
$ ls -la /path/to/file
$ id # Who am I?
# Step 2: Check the entire path
$ namei -l /path/to/file
# Every directory in the path needs execute permission
# Step 3: Check ACLs
$ getfacl /path/to/file
# Step 4: Check SELinux/AppArmor
$ ls -Z /path/to/file # SELinux context
$ sudo ausearch -m avc --start recent # Recent SELinux denials
$ sudo aa-status # AppArmor status
# Step 5: Check if running as correct user
$ ps aux | grep <process>
# Is the process running as the user that has permission?
# Step 6: Check capabilities (for privileged operations)
$ getcap /path/to/binary
Think About It: A web server returns "Permission denied" when trying to read files in
/var/www/html, butls -lashows the files are readable by everyone. What else could be blocking access? (Hint: think beyond standard permissions.)
Building a Troubleshooting Toolkit
Keep a cheat sheet of the most useful commands for each category:
┌──────────────────────────────────────────────────────────────┐
│ TROUBLESHOOTING TOOLKIT │
│ │
│ SYSTEM OVERVIEW │
│ • uptime, free -m, df -h, top, vmstat │
│ │
│ PROCESSES │
│ • ps aux, top, htop, pidof, pgrep, kill, strace │
│ │
│ LOGS │
│ • journalctl, dmesg, tail /var/log/* │
│ │
│ NETWORK │
│ • ip addr, ss -tlnp, ping, traceroute, dig, curl │
│ • tcpdump, nmap (when needed) │
│ │
│ DISK │
│ • df -h, df -i, du -sh, lsblk, iostat, lsof │
│ │
│ SERVICES │
│ • systemctl status/start/stop/restart/enable │
│ • systemctl --failed, systemctl list-units │
│ │
│ PERMISSIONS │
│ • ls -la, namei -l, getfacl, ls -Z (SELinux) │
│ │
│ PERFORMANCE │
│ • top, htop, vmstat, iostat, sar, perf │
│ │
│ HISTORY │
│ • last, lastlog, history, journalctl --since │
│ • rpm -qa --last, /var/log/apt/history.log │
│ │
└──────────────────────────────────────────────────────────────┘
Incident Response Basics
When a serious outage occurs, troubleshooting alone is not enough. You need a structured incident response.
The Incident Timeline
┌──────────────────────────────────────────────────────────────┐
│ INCIDENT RESPONSE │
│ │
│ 1. DETECT │
│ Monitoring alert fires, user reports, or you notice │
│ │ │
│ ▼ │
│ 2. TRIAGE │
│ How severe is this? Who is affected? Is it getting worse?│
│ Assign severity: SEV1 (critical), SEV2 (major), │
│ SEV3 (minor), SEV4 (cosmetic) │
│ │ │
│ ▼ │
│ 3. COMMUNICATE │
│ Notify stakeholders. Start an incident channel/thread. │
│ Post regular updates (every 15-30 min for SEV1). │
│ │ │
│ ▼ │
│ 4. MITIGATE │
│ Stop the bleeding. This might mean a rollback, │
│ failover, or temporary workaround -- not a full fix. │
│ │ │
│ ▼ │
│ 5. RESOLVE │
│ Fix the root cause properly. │
│ │ │
│ ▼ │
│ 6. POST-MORTEM │
│ Blameless review of what happened, why, and how to │
│ prevent it from happening again. │
│ │
└──────────────────────────────────────────────────────────────┘
The Golden Rule of Incidents
Mitigate first, investigate later. If rolling back a deployment fixes the problem, do that now. You can figure out why the deployment broke things tomorrow when the pressure is off.
Post-Mortems
After every significant incident, write a post-mortem (also called a "retrospective" or "incident review"). The goal is not to blame anyone -- it is to prevent the same problem from happening again.
Post-Mortem Template
INCIDENT POST-MORTEM
====================
Date: 2025-06-15
Duration: 2 hours 15 minutes
Severity: SEV2
Author: [Your name]
SUMMARY
-------
[One paragraph describing what happened]
TIMELINE
--------
14:00 - Monitoring alert: HTTP 500 error rate > 5%
14:05 - On-call engineer begins investigation
14:12 - Identified: application server OOM killed
14:15 - Attempted restart; server OOM killed again within 2 minutes
14:25 - Identified memory leak in recent deployment (v2.3.1)
14:30 - Rolled back to v2.3.0
14:35 - Service restored, error rate returning to normal
16:15 - Root cause fix deployed (v2.3.2) after code review
ROOT CAUSE
----------
[What actually caused the problem]
A memory leak in the session handler introduced in v2.3.1.
Each user request allocated 2MB of memory that was never freed.
IMPACT
------
[Who and what was affected]
- 45 minutes of degraded service for all users
- 30 minutes of complete outage for checkout flow
- Estimated 200 failed transactions
WHAT WENT WELL
--------------
- Monitoring detected the issue within 5 minutes
- Rollback was quick and effective
- Team communicated clearly throughout
WHAT COULD BE IMPROVED
-----------------------
- Memory leak was not caught in staging because load testing
was skipped for this release
- No automated canary deployment to catch issues early
ACTION ITEMS
------------
[ ] Add memory usage alerts (threshold: 80% for warning)
[ ] Require load testing for all releases
[ ] Implement canary deployment strategy
[ ] Add memory leak detection to CI pipeline
Hands-On: Troubleshooting Practice
Let us simulate a problem and walk through the methodology.
Simulate the problem:
# Create a service that will "break"
$ sudo tee /opt/scripts/fake-webapp.sh << 'SCRIPT'
#!/bin/bash
# Simulate a web application that creates a log file
while true; do
echo "$(date) - Request processed" >> /tmp/fake-webapp.log
sleep 0.1
done
SCRIPT
$ sudo chmod +x /opt/scripts/fake-webapp.sh
$ sudo tee /etc/systemd/system/fake-webapp.service << 'UNIT'
[Unit]
Description=Fake Web Application
After=network.target
[Service]
ExecStart=/opt/scripts/fake-webapp.sh
Restart=always
User=nobody
[Install]
WantedBy=multi-user.target
UNIT
$ sudo systemctl daemon-reload
$ sudo systemctl start fake-webapp
Now the service is running and writing to /tmp/fake-webapp.log.
Simulate the symptom: "The service is writing too much to disk."
# Step 1: Define the problem
# "fake-webapp is writing to disk continuously"
# Step 2: Gather information
$ systemctl status fake-webapp
$ ls -lh /tmp/fake-webapp.log
# Watch the file grow
$ watch -n1 'ls -lh /tmp/fake-webapp.log'
# Step 3: Hypothesis
# "The application is logging every request with no rotation"
# Step 4: Test
$ tail -5 /tmp/fake-webapp.log
# Confirms: one line every 0.1 seconds = 864,000 lines/day
# Step 5: Fix
# Immediate: stop the bleeding
$ sudo truncate -s 0 /tmp/fake-webapp.log
# Long-term: implement log rotation or reduce log verbosity
$ sudo tee /etc/logrotate.d/fake-webapp << 'EOF'
/tmp/fake-webapp.log {
hourly
rotate 4
compress
copytruncate
maxsize 10M
}
EOF
# Step 6: Verify
$ ls -lh /tmp/fake-webapp.log # Should be small again
$ sleep 10 && ls -lh /tmp/fake-webapp.log # Growing slowly
# Step 7: Document
# "fake-webapp writes ~10 lines/second to its log.
# Added logrotate config to cap at 10MB per file, keep 4 rotations.
# TODO: reduce log verbosity to only log errors, not every request."
# Cleanup
$ sudo systemctl stop fake-webapp
$ sudo systemctl disable fake-webapp
$ sudo rm /etc/systemd/system/fake-webapp.service
$ sudo systemctl daemon-reload
Debug This: Multi-Symptom Scenario
A developer reports multiple issues on a production server:
- "The app is slow"
- "Some pages show 500 errors"
- "Cron jobs are failing"
All three symptoms started at approximately the same time. How do you approach this?
Methodology: Look for a single root cause that explains all symptoms.
# Check disk space first (explains slow + errors + cron failures)
$ df -h
/dev/sda1 50G 50G 0 100% /
# Bingo! A full disk explains everything:
# - App is slow: can't write to disk (temp files, sessions)
# - 500 errors: can't write logs or temporary data
# - Cron failures: can't create lock files or write output
# Find the culprit
$ sudo du -sh /var/* | sort -rh | head -5
# Fix it
$ sudo truncate -s 0 /var/log/huge-log-file.log
# Verify all three symptoms are resolved
$ curl -s -o /dev/null -w "%{http_code}" http://localhost/
200
$ sudo -u cronuser crontab -l | head -1
# Run a test cron job manually
# Prevent recurrence
# Set up disk space monitoring and log rotation
Lesson: When multiple seemingly unrelated things break simultaneously, look for a single common cause. Disk full, memory exhaustion, and network outages are the most common culprits that produce cascading failures.
What Just Happened?
┌──────────────────────────────────────────────────────────────┐
│ CHAPTER 76 RECAP │
│──────────────────────────────────────────────────────────────│
│ │
│ Systematic troubleshooting methodology: │
│ 1. Define the problem (be specific) │
│ 2. Gather information (logs, metrics, status) │
│ 3. Form a hypothesis (start simple) │
│ 4. Test the hypothesis (change one thing at a time) │
│ 5. Implement the fix │
│ 6. Verify (from the user's perspective) │
│ 7. Document (for future you and your team) │
│ │
│ Key principles: │
│ • Read the error message -- it usually tells you what │
│ • Use the 5 Whys to find root causes │
│ • Mitigate first, investigate later during outages │
│ • Multiple symptoms often share a single root cause │
│ • Write blameless post-mortems after incidents │
│ │
│ Essential toolkit: │
│ • System: uptime, free, df, top, vmstat │
│ • Logs: journalctl, dmesg, /var/log/* │
│ • Network: ip, ss, ping, dig, curl, traceroute │
│ • Services: systemctl, systemctl --failed │
│ • Disk: du, lsblk, lsof, iostat │
│ • Permissions: ls -la, namei, getfacl, ausearch │
│ │
│ The best troubleshooters are systematic, not lucky. │
│ │
└──────────────────────────────────────────────────────────────┘
Try This
Exercise 1: Error Message Drill
For each error message below, write down three things you would check immediately:
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)FATAL: password authentication failed for user "webapp"bash: /opt/app/run.sh: Permission deniedkernel: Out of memory: Killed process 1234 (java)sshd: Connection closed by 10.0.0.5 port 22 [preauth]
Exercise 2: Simulate and Fix
Create a problem on a test system and fix it using the systematic methodology:
- Fill a filesystem to 100% and observe what breaks
- Create a service with a broken config file and troubleshoot why it will not start
- Block a port with iptables and diagnose the resulting connection failures
Exercise 3: Write a Post-Mortem
Think of a real incident you have experienced (even a minor one like a laptop crashing or a personal project failing). Write a post-mortem using the template from this chapter. Focus on action items that would prevent recurrence.
Exercise 4: Build a Runbook
Create a troubleshooting runbook for a service you manage. Include:
- Common failure modes and their symptoms
- Step-by-step diagnostic commands for each failure mode
- Known fixes and workarounds
- Escalation path (who to contact if you cannot fix it)
Bonus Challenge
Set up a "game day" on a test system. Have a colleague (or write a script that) introduces a problem -- full disk, killed service, wrong DNS config, firewall rule, permissions change -- and practice diagnosing it under a timer. Record your time-to-diagnosis and track improvement over multiple rounds. This is how the best ops teams train.