Troubleshooting Methodology

Why This Matters

It is 2am. The monitoring system is screaming. The website is down. Customers are tweeting about it. Your manager is on Slack asking for updates every three minutes. And you have no idea what is wrong.

This is the moment that separates an experienced Linux administrator from a beginner. Not because the experienced admin magically knows the answer, but because they have a systematic approach to finding it. They do not panic. They do not randomly restart services hoping something sticks. They follow a methodology.

Every chapter in this book has taught you specific skills: networking, storage, processes, services, security. This chapter teaches you how to combine those skills under pressure into a systematic troubleshooting process. This is arguably the most important chapter in the book, because real-world Linux work is primarily troubleshooting.


Try This Right Now

The next time something goes wrong on your system, resist the urge to immediately start fixing it. Instead, spend 60 seconds gathering information:

# What is the system's overall health?
$ uptime
$ free -m
$ df -h
$ dmesg | tail -20

# What changed recently?
$ last -10
$ journalctl --since "1 hour ago" -p err
$ rpm -qa --last | head -10     # RHEL-family
$ zcat /var/log/apt/history.log.*.gz | head -20   # Debian-family

# What is happening right now?
$ top -bn1 | head -20
$ ss -tlnp
$ systemctl --failed

Those commands take 30 seconds to run and will tell you more than 10 minutes of guessing.


The Systematic Troubleshooting Process

┌──────────────────────────────────────────────────────────────┐
│           SYSTEMATIC TROUBLESHOOTING                         │
│                                                              │
│  1. DEFINE THE PROBLEM                                       │
│     What exactly is broken? What should be happening?        │
│                    │                                         │
│                    ▼                                         │
│  2. GATHER INFORMATION                                       │
│     Logs, metrics, error messages, recent changes            │
│                    │                                         │
│                    ▼                                         │
│  3. FORM A HYPOTHESIS                                        │
│     Based on evidence, what is the most likely cause?        │
│                    │                                         │
│                    ▼                                         │
│  4. TEST THE HYPOTHESIS                                      │
│     Design a test that proves or disproves your theory       │
│                    │                                         │
│                    ▼                                         │
│  5. IMPLEMENT THE FIX                                        │
│     Apply the solution                                       │
│                    │                                         │
│                    ▼                                         │
│  6. VERIFY                                                   │
│     Confirm the problem is actually resolved                 │
│                    │                                         │
│                    ▼                                         │
│  7. DOCUMENT                                                 │
│     Record what happened, what caused it, how it was fixed   │
│                                                              │
│  If your hypothesis is wrong at step 4, go back to step 3.  │
│  Do NOT skip step 6 -- "it seems to work" is not enough.    │
│  Do NOT skip step 7 -- future you will thank present you.   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Step 1: Define the Problem

Before you can fix something, you must understand what is broken. Vague problem statements lead to wasted effort.

Bad: "The server is down." Better: "Users cannot load the website. The server responds to ping but port 443 returns connection refused."

Bad: "It's slow." Better: "Page load times increased from 200ms to 8 seconds starting at 14:00 today."

Ask:

  • What is the expected behavior?
  • What is the actual behavior?
  • When did it start?
  • Who is affected?
  • What changed before it started?

Step 2: Gather Information

This is where your Linux toolbox comes in. Gather facts, not opinions.

# System overview
$ uptime                           # Load and uptime
$ free -m                          # Memory usage
$ df -h                            # Disk usage
$ top -bn1 | head -30              # Top processes

# Recent events
$ journalctl --since "30 min ago" -p err --no-pager
$ dmesg | tail -30                 # Kernel messages
$ last -5                          # Recent logins

# Service status
$ systemctl status <service>       # Specific service
$ systemctl --failed               # All failed services

# Network
$ ss -tlnp                         # Listening ports
$ ip addr                          # IP addresses
$ ping -c3 <gateway>               # Basic connectivity
$ dig <hostname>                   # DNS resolution

# Recent changes
$ journalctl -u <service> --since "1 hour ago"
$ stat /etc/<config-file>          # When was config last changed?

Step 3: Form a Hypothesis

Based on the evidence, propose the most likely cause. Start with the simplest explanation -- is the service running? Is the disk full? Is the network up?

Step 4: Test the Hypothesis

Design a test that will either confirm or eliminate your hypothesis. Do not change multiple things at once -- that makes it impossible to know what fixed the problem.

Step 5: Implement the Fix

Apply the minimum change needed to resolve the issue. Document what you change before you change it.

Step 6: Verify

Confirm the problem is fully resolved, not just partially. Check from the user's perspective, not just from the server.

Step 7: Document

Write down:

  • What the symptoms were
  • What caused the problem
  • What you did to fix it
  • How to prevent it in the future

The 5 Whys Technique

When you find the immediate cause, keep asking "Why?" to find the root cause.

Problem: The website is down.
  Why? → Nginx is not running.
    Why? → It crashed due to out-of-memory.
      Why? → A memory leak in the PHP application.
        Why? → A new deployment introduced a bug in session handling.
          Why? → The code change was not reviewed and had no tests.

ROOT CAUSE: Missing code review and test coverage.
FIX: Fix the memory leak AND add code review + tests.

If you only fix "Nginx is not running" by restarting it, the problem will return. The 5 Whys drives you to the real root cause.

Think About It: A server's disk filled up because log files grew too large. "Delete the logs" fixes the immediate problem. What are the 5 Whys, and what is the real fix?


Reading Error Messages and Logs

The most underrated troubleshooting skill is actually reading the error message. Most errors tell you exactly what is wrong if you read carefully.

Common Error Patterns

┌──────────────────────────────────────────────────────────────┐
│              COMMON ERROR MESSAGES AND WHAT THEY MEAN         │
│                                                              │
│  "Permission denied"                                         │
│  → File permissions, SELinux/AppArmor, or capability issue   │
│    Check: ls -la, getenforce, journalctl for AVC denials     │
│                                                              │
│  "No such file or directory"                                 │
│  → Path is wrong, file was deleted, or filesystem not mounted│
│    Check: ls, mount, findmnt                                 │
│                                                              │
│  "Connection refused"                                        │
│  → Service is not running or not listening on that port      │
│    Check: systemctl status, ss -tlnp                         │
│                                                              │
│  "Connection timed out"                                      │
│  → Firewall blocking, network unreachable, or service hung   │
│    Check: iptables, ping, traceroute                         │
│                                                              │
│  "No space left on device"                                   │
│  → Disk full OR inodes exhausted                             │
│    Check: df -h, df -i                                       │
│                                                              │
│  "Address already in use"                                    │
│  → Another process is using that port                        │
│    Check: ss -tlnp | grep <port>                             │
│                                                              │
│  "Name or service not known"                                 │
│  → DNS resolution failure                                    │
│    Check: dig, cat /etc/resolv.conf, systemd-resolve --status│
│                                                              │
│  "Out of memory: Killed process"                             │
│  → OOM killer terminated a process                           │
│    Check: dmesg | grep -i oom, journalctl -k                 │
│                                                              │
│  "Segmentation fault"                                        │
│  → Application bug (accessing invalid memory)                │
│    Check: coredump, application logs                         │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Where to Find Logs

# Systemd journal (most services)
$ journalctl -u <service-name> --since "1 hour ago"

# System-wide errors
$ journalctl -p err --since today

# Kernel messages
$ dmesg | tail -50

# Traditional log files
$ ls /var/log/
$ tail -50 /var/log/syslog           # Debian/Ubuntu
$ tail -50 /var/log/messages          # RHEL-family

# Application-specific logs
$ tail -50 /var/log/nginx/error.log
$ tail -50 /var/log/postgresql/postgresql-15-main.log
$ tail -50 /var/log/mysql/error.log

Troubleshooting Scenarios

Let us walk through the most common real-world problems with a systematic approach.

Scenario 1: Cannot SSH into a Server

# Step 1: Define the problem
# "SSH connection to 10.0.0.5 hangs and eventually times out"

# Step 2: Gather information
# From another machine that CAN reach the server (or console access):

# Is the machine up?
$ ping -c3 10.0.0.5

# Is SSH listening?
$ ss -tlnp | grep :22

# Is sshd running?
$ systemctl status sshd

# Check firewall
$ sudo iptables -L -n | grep 22
$ sudo firewall-cmd --list-all

# Check SSH config
$ sudo sshd -T | grep -i "listen\|permit\|allow\|deny"

# Check for failed login attempts
$ journalctl -u sshd --since "1 hour ago" | tail -20

# Check if fail2ban blocked the IP
$ sudo fail2ban-client status sshd

Common causes and fixes:

  • sshd not running: sudo systemctl start sshd
  • Firewall blocking: sudo firewall-cmd --add-service=ssh --permanent && sudo firewall-cmd --reload
  • fail2ban banned the IP: sudo fail2ban-client set sshd unbanip <IP>
  • /etc/hosts.deny blocking: Check and edit
  • Wrong port: Check Port directive in /etc/ssh/sshd_config
  • Key authentication failed: Check ~/.ssh/authorized_keys permissions (must be 600)

Scenario 2: Website Down (HTTP Error)

# Step 1: What does the user see?
$ curl -I http://example.com
# Connection refused? 500 error? Timeout?

# Step 2: Check the web server
$ systemctl status nginx
# or
$ systemctl status apache2

# Check error logs
$ tail -30 /var/log/nginx/error.log

# Is it listening?
$ ss -tlnp | grep -E ':80|:443'

# Check config syntax
$ nginx -t

# Check backend application
$ systemctl status myapp

# Check disk space (full disk = can't write logs = crash)
$ df -h

# Check memory (OOM = killed process)
$ free -m
$ dmesg | grep -i oom

Common causes:

  • Web server not running (restart it)
  • Config syntax error (fix config, run nginx -t)
  • Backend application crashed (check app logs)
  • Disk full (clean up, check log rotation)
  • Permissions changed on document root
  • SSL certificate expired (check with openssl s_client)

Scenario 3: Disk Full

# Step 1: Confirm and identify
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   50G     0 100% /

# Step 2: Find what is using the space
$ sudo du -sh /* 2>/dev/null | sort -rh | head -10
28G     /var
12G     /home
5G      /usr
3G      /opt

$ sudo du -sh /var/* | sort -rh | head -5
25G     /var/log
2G      /var/lib

$ sudo du -sh /var/log/* | sort -rh | head -5
22G     /var/log/myapp
2G      /var/log/syslog

# Step 3: Identify the specific culprit
$ sudo ls -lhS /var/log/myapp/ | head -5
-rw-r--r-- 1 myapp myapp 20G Jun 15 14:32 application.log

# Step 4: Fix
# Immediate: truncate the file (not delete -- deleting a file held open does not free space)
$ sudo truncate -s 0 /var/log/myapp/application.log

# Long-term: set up log rotation
$ sudo cat > /etc/logrotate.d/myapp << 'EOF'
/var/log/myapp/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    copytruncate
    maxsize 500M
}
EOF

Safety Warning: If you delete a file that is still open by a process, the space is not freed until the process releases the file handle. Use truncate -s 0 instead, or restart the process after deleting. Check with lsof +L1 to find deleted-but-open files.

Scenario 4: High Load Average

# Step 1: Check load average
$ uptime
 14:32:07 up 30 days, load average: 24.5, 22.3, 18.7
# On a 4-CPU system, load > 4 means processes are waiting

# Step 2: Identify what is causing the load
$ top -bn1 | head -20
# Look at CPU% and state columns

# Is it CPU-bound or I/O-bound?
$ vmstat 1 5
# High 'wa' (wait) = I/O bound
# High 'us' (user) or 'sy' (system) = CPU bound

# If I/O bound, check disk I/O
$ iostat -x 1 5
# Look for high %util, high await

# If CPU bound, find the hungry processes
$ ps aux --sort=-%cpu | head -10

# Check for process storms
$ ps aux | wc -l
# If unusually high, something might be forking excessively

Scenario 5: Service Will Not Start

# Step 1: Check the status
$ systemctl status myservice
# Read the error message -- it usually tells you exactly what is wrong

# Step 2: Check the journal
$ journalctl -u myservice --since "5 min ago" --no-pager

# Step 3: Common causes checklist
# Config syntax error?
$ myservice --check-config   # Many services support this

# Missing dependency?
$ systemctl list-dependencies myservice

# Port already in use?
$ ss -tlnp | grep <port>

# Permission issue?
$ ls -la /etc/myservice/
$ ls -la /var/run/myservice/
$ namei -l /var/run/myservice/myservice.sock   # Check path permissions

# SELinux blocking?
$ sudo ausearch -m avc --start recent
$ sudo sealert -a /var/log/audit/audit.log

Scenario 6: Network Unreachable

# Step 1: Check local network config
$ ip addr
$ ip route

# Step 2: Can you reach the gateway?
$ ping -c3 $(ip route | grep default | awk '{print $3}')

# Step 3: Can you reach external IPs?
$ ping -c3 1.1.1.1

# Step 4: Is DNS working?
$ dig google.com
# or
$ nslookup google.com

# Step 5: Check for firewall issues
$ sudo iptables -L -n
$ sudo nft list ruleset

# Step 6: Check physical layer
$ ip link show
$ ethtool eth0 | grep -i "link detected"

# Decision tree:
# Can't reach gateway → local network/cable/interface issue
# Can reach gateway but not internet → routing or upstream issue
# Can reach IPs but not names → DNS issue
# Can reach some hosts but not others → firewall or routing issue

Scenario 7: DNS Not Resolving

# Step 1: Confirm DNS is the issue
$ ping 1.1.1.1           # Works? Then network is fine.
$ ping google.com         # Fails? DNS is the problem.

# Step 2: Check DNS configuration
$ cat /etc/resolv.conf
$ resolvectl status       # systemd-resolved systems

# Step 3: Test DNS directly
$ dig @1.1.1.1 google.com    # Use a known-good DNS server
$ dig @$(grep nameserver /etc/resolv.conf | head -1 | awk '{print $2}') google.com

# Step 4: Check if systemd-resolved is running
$ systemctl status systemd-resolved

# Step 5: Common fixes
# Add a working nameserver
$ echo "nameserver 1.1.1.1" | sudo tee /etc/resolv.conf

# Restart systemd-resolved
$ sudo systemctl restart systemd-resolved

Scenario 8: Permission Denied

# Step 1: Check standard Unix permissions
$ ls -la /path/to/file
$ id                    # Who am I?

# Step 2: Check the entire path
$ namei -l /path/to/file
# Every directory in the path needs execute permission

# Step 3: Check ACLs
$ getfacl /path/to/file

# Step 4: Check SELinux/AppArmor
$ ls -Z /path/to/file                    # SELinux context
$ sudo ausearch -m avc --start recent    # Recent SELinux denials
$ sudo aa-status                          # AppArmor status

# Step 5: Check if running as correct user
$ ps aux | grep <process>
# Is the process running as the user that has permission?

# Step 6: Check capabilities (for privileged operations)
$ getcap /path/to/binary

Think About It: A web server returns "Permission denied" when trying to read files in /var/www/html, but ls -la shows the files are readable by everyone. What else could be blocking access? (Hint: think beyond standard permissions.)


Building a Troubleshooting Toolkit

Keep a cheat sheet of the most useful commands for each category:

┌──────────────────────────────────────────────────────────────┐
│              TROUBLESHOOTING TOOLKIT                          │
│                                                              │
│  SYSTEM OVERVIEW                                             │
│  • uptime, free -m, df -h, top, vmstat                       │
│                                                              │
│  PROCESSES                                                   │
│  • ps aux, top, htop, pidof, pgrep, kill, strace             │
│                                                              │
│  LOGS                                                        │
│  • journalctl, dmesg, tail /var/log/*                        │
│                                                              │
│  NETWORK                                                     │
│  • ip addr, ss -tlnp, ping, traceroute, dig, curl            │
│  • tcpdump, nmap (when needed)                               │
│                                                              │
│  DISK                                                        │
│  • df -h, df -i, du -sh, lsblk, iostat, lsof                │
│                                                              │
│  SERVICES                                                    │
│  • systemctl status/start/stop/restart/enable                │
│  • systemctl --failed, systemctl list-units                  │
│                                                              │
│  PERMISSIONS                                                 │
│  • ls -la, namei -l, getfacl, ls -Z (SELinux)               │
│                                                              │
│  PERFORMANCE                                                 │
│  • top, htop, vmstat, iostat, sar, perf                      │
│                                                              │
│  HISTORY                                                     │
│  • last, lastlog, history, journalctl --since                │
│  • rpm -qa --last, /var/log/apt/history.log                  │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Incident Response Basics

When a serious outage occurs, troubleshooting alone is not enough. You need a structured incident response.

The Incident Timeline

┌──────────────────────────────────────────────────────────────┐
│                  INCIDENT RESPONSE                           │
│                                                              │
│  1. DETECT                                                   │
│     Monitoring alert fires, user reports, or you notice      │
│                    │                                         │
│                    ▼                                         │
│  2. TRIAGE                                                   │
│     How severe is this? Who is affected? Is it getting worse?│
│     Assign severity: SEV1 (critical), SEV2 (major),         │
│                      SEV3 (minor), SEV4 (cosmetic)          │
│                    │                                         │
│                    ▼                                         │
│  3. COMMUNICATE                                              │
│     Notify stakeholders. Start an incident channel/thread.   │
│     Post regular updates (every 15-30 min for SEV1).        │
│                    │                                         │
│                    ▼                                         │
│  4. MITIGATE                                                 │
│     Stop the bleeding. This might mean a rollback,          │
│     failover, or temporary workaround -- not a full fix.    │
│                    │                                         │
│                    ▼                                         │
│  5. RESOLVE                                                  │
│     Fix the root cause properly.                             │
│                    │                                         │
│                    ▼                                         │
│  6. POST-MORTEM                                              │
│     Blameless review of what happened, why, and how to      │
│     prevent it from happening again.                        │
│                                                              │
└──────────────────────────────────────────────────────────────┘

The Golden Rule of Incidents

Mitigate first, investigate later. If rolling back a deployment fixes the problem, do that now. You can figure out why the deployment broke things tomorrow when the pressure is off.


Post-Mortems

After every significant incident, write a post-mortem (also called a "retrospective" or "incident review"). The goal is not to blame anyone -- it is to prevent the same problem from happening again.

Post-Mortem Template

INCIDENT POST-MORTEM
====================
Date: 2025-06-15
Duration: 2 hours 15 minutes
Severity: SEV2
Author: [Your name]

SUMMARY
-------
[One paragraph describing what happened]

TIMELINE
--------
14:00 - Monitoring alert: HTTP 500 error rate > 5%
14:05 - On-call engineer begins investigation
14:12 - Identified: application server OOM killed
14:15 - Attempted restart; server OOM killed again within 2 minutes
14:25 - Identified memory leak in recent deployment (v2.3.1)
14:30 - Rolled back to v2.3.0
14:35 - Service restored, error rate returning to normal
16:15 - Root cause fix deployed (v2.3.2) after code review

ROOT CAUSE
----------
[What actually caused the problem]
A memory leak in the session handler introduced in v2.3.1.
Each user request allocated 2MB of memory that was never freed.

IMPACT
------
[Who and what was affected]
- 45 minutes of degraded service for all users
- 30 minutes of complete outage for checkout flow
- Estimated 200 failed transactions

WHAT WENT WELL
--------------
- Monitoring detected the issue within 5 minutes
- Rollback was quick and effective
- Team communicated clearly throughout

WHAT COULD BE IMPROVED
-----------------------
- Memory leak was not caught in staging because load testing
  was skipped for this release
- No automated canary deployment to catch issues early

ACTION ITEMS
------------
[ ] Add memory usage alerts (threshold: 80% for warning)
[ ] Require load testing for all releases
[ ] Implement canary deployment strategy
[ ] Add memory leak detection to CI pipeline

Hands-On: Troubleshooting Practice

Let us simulate a problem and walk through the methodology.

Simulate the problem:

# Create a service that will "break"
$ sudo tee /opt/scripts/fake-webapp.sh << 'SCRIPT'
#!/bin/bash
# Simulate a web application that creates a log file
while true; do
    echo "$(date) - Request processed" >> /tmp/fake-webapp.log
    sleep 0.1
done
SCRIPT
$ sudo chmod +x /opt/scripts/fake-webapp.sh

$ sudo tee /etc/systemd/system/fake-webapp.service << 'UNIT'
[Unit]
Description=Fake Web Application
After=network.target

[Service]
ExecStart=/opt/scripts/fake-webapp.sh
Restart=always
User=nobody

[Install]
WantedBy=multi-user.target
UNIT

$ sudo systemctl daemon-reload
$ sudo systemctl start fake-webapp

Now the service is running and writing to /tmp/fake-webapp.log.

Simulate the symptom: "The service is writing too much to disk."

# Step 1: Define the problem
# "fake-webapp is writing to disk continuously"

# Step 2: Gather information
$ systemctl status fake-webapp
$ ls -lh /tmp/fake-webapp.log
# Watch the file grow
$ watch -n1 'ls -lh /tmp/fake-webapp.log'

# Step 3: Hypothesis
# "The application is logging every request with no rotation"

# Step 4: Test
$ tail -5 /tmp/fake-webapp.log
# Confirms: one line every 0.1 seconds = 864,000 lines/day

# Step 5: Fix
# Immediate: stop the bleeding
$ sudo truncate -s 0 /tmp/fake-webapp.log

# Long-term: implement log rotation or reduce log verbosity
$ sudo tee /etc/logrotate.d/fake-webapp << 'EOF'
/tmp/fake-webapp.log {
    hourly
    rotate 4
    compress
    copytruncate
    maxsize 10M
}
EOF

# Step 6: Verify
$ ls -lh /tmp/fake-webapp.log   # Should be small again
$ sleep 10 && ls -lh /tmp/fake-webapp.log  # Growing slowly

# Step 7: Document
# "fake-webapp writes ~10 lines/second to its log.
#  Added logrotate config to cap at 10MB per file, keep 4 rotations.
#  TODO: reduce log verbosity to only log errors, not every request."

# Cleanup
$ sudo systemctl stop fake-webapp
$ sudo systemctl disable fake-webapp
$ sudo rm /etc/systemd/system/fake-webapp.service
$ sudo systemctl daemon-reload

Debug This: Multi-Symptom Scenario

A developer reports multiple issues on a production server:

  1. "The app is slow"
  2. "Some pages show 500 errors"
  3. "Cron jobs are failing"

All three symptoms started at approximately the same time. How do you approach this?

Methodology: Look for a single root cause that explains all symptoms.

# Check disk space first (explains slow + errors + cron failures)
$ df -h
/dev/sda1        50G   50G     0 100% /

# Bingo! A full disk explains everything:
# - App is slow: can't write to disk (temp files, sessions)
# - 500 errors: can't write logs or temporary data
# - Cron failures: can't create lock files or write output

# Find the culprit
$ sudo du -sh /var/* | sort -rh | head -5

# Fix it
$ sudo truncate -s 0 /var/log/huge-log-file.log

# Verify all three symptoms are resolved
$ curl -s -o /dev/null -w "%{http_code}" http://localhost/
200

$ sudo -u cronuser crontab -l | head -1
# Run a test cron job manually

# Prevent recurrence
# Set up disk space monitoring and log rotation

Lesson: When multiple seemingly unrelated things break simultaneously, look for a single common cause. Disk full, memory exhaustion, and network outages are the most common culprits that produce cascading failures.


What Just Happened?

┌──────────────────────────────────────────────────────────────┐
│                    CHAPTER 76 RECAP                           │
│──────────────────────────────────────────────────────────────│
│                                                              │
│  Systematic troubleshooting methodology:                     │
│  1. Define the problem (be specific)                         │
│  2. Gather information (logs, metrics, status)               │
│  3. Form a hypothesis (start simple)                         │
│  4. Test the hypothesis (change one thing at a time)         │
│  5. Implement the fix                                        │
│  6. Verify (from the user's perspective)                     │
│  7. Document (for future you and your team)                  │
│                                                              │
│  Key principles:                                             │
│  • Read the error message -- it usually tells you what       │
│  • Use the 5 Whys to find root causes                        │
│  • Mitigate first, investigate later during outages          │
│  • Multiple symptoms often share a single root cause         │
│  • Write blameless post-mortems after incidents              │
│                                                              │
│  Essential toolkit:                                          │
│  • System: uptime, free, df, top, vmstat                     │
│  • Logs: journalctl, dmesg, /var/log/*                       │
│  • Network: ip, ss, ping, dig, curl, traceroute              │
│  • Services: systemctl, systemctl --failed                   │
│  • Disk: du, lsblk, lsof, iostat                             │
│  • Permissions: ls -la, namei, getfacl, ausearch             │
│                                                              │
│  The best troubleshooters are systematic, not lucky.         │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Try This

Exercise 1: Error Message Drill

For each error message below, write down three things you would check immediately:

  1. nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
  2. FATAL: password authentication failed for user "webapp"
  3. bash: /opt/app/run.sh: Permission denied
  4. kernel: Out of memory: Killed process 1234 (java)
  5. sshd: Connection closed by 10.0.0.5 port 22 [preauth]

Exercise 2: Simulate and Fix

Create a problem on a test system and fix it using the systematic methodology:

  1. Fill a filesystem to 100% and observe what breaks
  2. Create a service with a broken config file and troubleshoot why it will not start
  3. Block a port with iptables and diagnose the resulting connection failures

Exercise 3: Write a Post-Mortem

Think of a real incident you have experienced (even a minor one like a laptop crashing or a personal project failing). Write a post-mortem using the template from this chapter. Focus on action items that would prevent recurrence.

Exercise 4: Build a Runbook

Create a troubleshooting runbook for a service you manage. Include:

  • Common failure modes and their symptoms
  • Step-by-step diagnostic commands for each failure mode
  • Known fixes and workarounds
  • Escalation path (who to contact if you cannot fix it)

Bonus Challenge

Set up a "game day" on a test system. Have a colleague (or write a script that) introduces a problem -- full disk, killed service, wrong DNS config, firewall rule, permissions change -- and practice diagnosing it under a timer. Record your time-to-diagnosis and track improvement over multiple rounds. This is how the best ops teams train.