System Monitoring Tools
Why This Matters
Your web application is slow. Users are complaining. Is the CPU maxed out? Is the server swapping to disk? Is one process consuming all the memory? Is the disk I/O saturated?
You cannot fix what you cannot see. System monitoring tools are how you see inside a running Linux system. They reveal CPU usage, memory consumption, disk activity, process states, and system load -- the vital signs of your server.
Every experienced Linux administrator has these tools at their fingertips. When a system is misbehaving, the first thing they do is open top or htop and start reading the numbers. Within 30 seconds, they know whether the problem is CPU, memory, disk, or something else entirely. This chapter teaches you to read those numbers and know what they mean.
Try This Right Now
Open a terminal and run these commands. Look at what your system is doing right now:
# How long has the system been running? What is the load?
$ uptime
14:23:15 up 12 days, 3:45, 2 users, load average: 0.52, 0.38, 0.41
# Quick process overview
$ top -bn1 | head -20
# Memory at a glance
$ free -h
# Disk usage
$ df -h
# What is using the CPU right now?
$ ps aux --sort=-%cpu | head -10
Understanding Load Average
Before diving into tools, you need to understand load average -- the single most common metric you will encounter.
$ uptime
14:23:15 up 12 days, 3:45, 2 users, load average: 0.52, 0.38, 0.41
^^^^ ^^^^ ^^^^
1min 5min 15min
Load average represents the average number of processes that are either running on a CPU or waiting for a CPU (in a runnable state), plus processes in uninterruptible sleep (usually waiting for disk I/O).
What Do the Numbers Mean?
Think of CPU cores as checkout lanes in a supermarket:
Single-core system (1 checkout lane):
Load 0.5: Lane is 50% utilized. No waiting.
Load 1.0: Lane is 100% utilized. Exactly saturated.
Load 2.0: Lane is full + 1 person waiting. Overloaded.
Quad-core system (4 checkout lanes):
Load 2.0: 2 of 4 lanes busy. 50% utilized. Fine.
Load 4.0: All 4 lanes busy. 100% saturated.
Load 8.0: All lanes full + 4 people waiting. Overloaded.
Rule of thumb:
Load < number of cores → System is coping fine
Load = number of cores → System is at capacity
Load > number of cores → System is overloaded
# How many CPU cores do you have?
$ nproc
4
# Or from /proc
$ grep -c ^processor /proc/cpuinfo
4
# So for this system:
# Load < 4.0 → OK
# Load = 4.0 → at capacity
# Load > 4.0 → overloaded
Reading the Three Load Average Numbers
The three numbers (1-minute, 5-minute, 15-minute) tell a story:
| Pattern | 1min | 5min | 15min | Interpretation |
|---|---|---|---|---|
| Stable | 2.0 | 2.0 | 2.0 | Consistent moderate load |
| Spike | 8.0 | 2.0 | 1.5 | Recent sudden load (investigate now) |
| Increasing | 2.0 | 4.0 | 6.0 | Load is decreasing (recovering) |
| Decreasing | 6.0 | 4.0 | 2.0 | Load is increasing (getting worse) |
Wait -- the third row seems backwards. Remember: the 15-minute average is the oldest. If 15min is highest and 1min is lowest, the system was heavily loaded 15 minutes ago and is getting better.
# Detailed load information
$ cat /proc/loadavg
0.52 0.38 0.41 2/345 28547
# ^^^^^ ^^^^
# | Last PID assigned
# running/total processes
Think About It: A server has 2 CPU cores and a load average of 4.0, 3.5, 1.0. Is this getting better or worse? What might have happened recently?
top: The Classic Process Monitor
top is installed on virtually every Linux system. It provides a real-time view of processes, CPU, and memory.
$ top
top - 14:30:00 up 12 days, 3:52, 2 users, load average: 0.52, 0.38, 0.41
Tasks: 245 total, 1 running, 244 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.2 us, 2.1 sy, 0.0 ni, 91.8 id, 0.5 wa, 0.0 hi, 0.4 si, 0.0 st
MiB Mem : 16384.0 total, 8234.5 free, 4521.2 used, 3628.3 buff/cache
MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 11458.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 mysql 20 0 2.3g 512m 32m S 8.5 3.1 125:34.56 mysqld
5678 www-data 20 0 856m 234m 12m S 3.2 1.4 45:12.89 apache2
901 root 20 0 123m 45m 8m S 1.1 0.3 12:45.67 systemd-journald
Understanding the top Header
CPU line breakdown:
| Field | Meaning |
|---|---|
us (user) | Time running user processes |
sy (system) | Time running kernel code |
ni (nice) | Time running niced (lower priority) processes |
id (idle) | Time doing nothing |
wa (I/O wait) | Time waiting for disk I/O |
hi (hardware interrupt) | Time handling hardware interrupts |
si (software interrupt) | Time handling software interrupts |
st (steal) | Time stolen by hypervisor (VMs only) |
Key indicators of problems:
- High
wa→ disk I/O bottleneck - High
us+ lowid→ CPU-bound process - High
sy→ lots of system calls (possible I/O or context switching) - High
st→ VM is not getting enough CPU from the host
top Interactive Commands
While top is running:
| Key | Action |
|---|---|
1 | Toggle individual CPU cores display |
M | Sort by memory usage |
P | Sort by CPU usage |
T | Sort by cumulative time |
k | Kill a process (enter PID) |
r | Renice a process (change priority) |
f | Choose which fields to display |
c | Toggle full command line display |
H | Toggle thread display |
q | Quit |
Batch Mode for Scripts
# Run top once and capture output (for scripts/logs)
$ top -bn1 | head -30
# Run top 5 times with 2-second intervals
$ top -bn5 -d2 > /tmp/top-capture.txt
htop: top Made Beautiful
htop is an enhanced, interactive process viewer. If you use only one monitoring tool, make it htop.
# Install htop
$ sudo apt install htop # Debian/Ubuntu
$ sudo dnf install htop # Fedora/RHEL
$ htop
0[|||||||||||||||| 35.2%] Tasks: 245, 128 thr; 1 running
1[|||||| 12.5%] Load average: 0.52 0.38 0.41
2[|||||||||||| 28.7%] Uptime: 12 days, 03:52:15
3[||||| 10.1%]
Mem[|||||||||||||||||||| 4.52G/16.0G]
Swp[ 0K/4.00G]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
1234 mysql 20 0 2354M 512M 32.4M S 8.5 3.1 125:34 /usr/sbin/mysqld
5678 www-data 20 0 856M 234M 12.8M S 3.2 1.4 45:12 /usr/sbin/apache2
htop Advantages Over top
- Color-coded CPU and memory bars
- Mouse support (click to sort, scroll to navigate)
- Horizontal and vertical scrolling (see full command lines)
- Tree view (shows parent-child process relationships)
- Easy process filtering and searching
htop Interactive Commands
| Key | Action |
|---|---|
F1 | Help |
F2 | Setup (customize display) |
F3 | Search by name |
F4 | Filter processes |
F5 | Tree view |
F6 | Sort by column |
F9 | Kill process (choose signal) |
F10 | Quit |
t | Toggle tree view |
u | Filter by user |
p | Toggle program path |
H | Toggle user threads |
Filtering and Searching in htop
# Press F4, type "nginx" to show only nginx processes
# Press F3, type "python" to find the next python process
# Press u, select "www-data" to show only that user's processes
vmstat: Virtual Memory Statistics
vmstat provides a snapshot of system-wide CPU, memory, I/O, and process statistics. It is ideal for spotting trends over time.
# Run vmstat every 2 seconds, 10 times
$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 8234496 123456 3628032 0 0 5 12 125 456 5 2 92 1 0
0 0 0 8234112 123456 3628064 0 0 0 8 118 434 3 1 95 1 0
2 0 0 8233728 123460 3628128 0 0 0 24 145 512 8 3 88 1 0
vmstat Column Reference
| Column | Meaning | Watch For |
|---|---|---|
r | Processes waiting for CPU | > number of cores = CPU bottleneck |
b | Processes in uninterruptible sleep | > 0 sustained = I/O bottleneck |
swpd | Virtual memory used (KB) | Growing = memory pressure |
si | Swap in (KB/s) | > 0 sustained = needs more RAM |
so | Swap out (KB/s) | > 0 sustained = needs more RAM |
bi | Blocks read from disk/s | High = lots of disk reads |
bo | Blocks written to disk/s | High = lots of disk writes |
us | CPU user time % | High = CPU-bound workload |
sy | CPU system time % | High = lots of system calls |
id | CPU idle time % | Low = CPU bottleneck |
wa | CPU I/O wait time % | High = disk I/O bottleneck |
Using vmstat to Diagnose
# "Is my system swapping?"
$ vmstat 1 5
# Look at si and so columns. If they are consistently > 0, you need more RAM.
# "Is my disk the bottleneck?"
# Look at the wa column and b column.
# High wa + high b = disk I/O saturation
iostat: Disk I/O Statistics
iostat shows CPU and disk I/O statistics. It is part of the sysstat package.
# Install sysstat
$ sudo apt install sysstat # Debian/Ubuntu
$ sudo dnf install sysstat # Fedora/RHEL
# Basic iostat
$ iostat
Linux 6.1.0 (myhost) 01/18/2025 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.23 0.00 2.14 0.52 0.00 92.11
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn
sda 12.45 45.67 123.45 0.00 4567890 12345678
# Extended stats with 2-second interval
$ iostat -x 2
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util await r_await w_await
sda 5.23 7.22 45.67 123.45 0.45 2.34 3.45 1.23 0.89 1.45
Key fields:
%util-- percentage of time the device was busy. 100% means saturated.await-- average time (ms) for I/O requests. High values = slow disk.r/s,w/s-- reads and writes per second.
mpstat: Per-CPU Statistics
# Show all CPUs individually
$ mpstat -P ALL 2
14:30:00 CPU %usr %nice %sys %iowait %irq %soft %steal %idle
14:30:02 all 5.23 0.00 2.14 0.52 0.00 0.12 0.00 91.99
14:30:02 0 8.50 0.00 3.00 1.00 0.00 0.50 0.00 87.00
14:30:02 1 12.00 0.00 4.00 0.00 0.00 0.00 0.00 84.00
14:30:02 2 1.00 0.00 0.50 0.50 0.00 0.00 0.00 98.00
14:30:02 3 0.50 0.00 1.00 0.00 0.00 0.00 0.00 98.50
This reveals whether load is spread across cores or concentrated on one (which can indicate a single-threaded bottleneck).
sar: System Activity Reporter
sar (also from sysstat) collects and reports system activity data over time. It is the closest thing to a built-in monitoring system.
# Enable data collection (usually done by sysstat package)
$ sudo systemctl enable --now sysstat
# View today's CPU data
$ sar -u
14:00:01 CPU %user %nice %system %iowait %steal %idle
14:10:01 all 5.23 0.00 2.14 0.52 0.00 92.11
14:20:01 all 3.45 0.00 1.78 0.34 0.00 94.43
14:30:01 all 12.67 0.00 4.56 1.23 0.00 81.54
...
# View memory usage over time
$ sar -r
# View disk I/O over time
$ sar -d
# View network statistics
$ sar -n DEV
# View a specific day's data
$ sar -u -f /var/log/sysstat/sa18 # 18th of the month
# View data for a specific time range
$ sar -u -s 14:00:00 -e 16:00:00
sar data is collected every 10 minutes by default (via a cron job or systemd timer). This gives you historical data to analyze trends and correlate with incidents.
Key /proc Files for Monitoring
The /proc filesystem is the source of truth for all monitoring tools. Here are the most useful files:
# Load average
$ cat /proc/loadavg
0.52 0.38 0.41 2/345 28547
# Memory details
$ cat /proc/meminfo
MemTotal: 16777216 kB
MemFree: 8234496 kB
MemAvailable: 11458560 kB
Buffers: 123456 kB
Cached: 3628032 kB
SwapTotal: 4194304 kB
SwapFree: 4194304 kB
...
# CPU information
$ cat /proc/cpuinfo | grep "model name" | head -1
model name : Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
# Per-process information
$ cat /proc/1234/status # replace 1234 with a PID
Name: mysqld
State: S (sleeping)
VmRSS: 524288 kB
Threads: 32
...
glances: A Modern All-in-One Monitor
glances is a cross-platform monitoring tool that combines top, iostat, iftop, and more into a single display.
# Install glances
$ sudo apt install glances # Debian/Ubuntu
$ sudo dnf install glances # Fedora/RHEL
$ pip install glances # via pip
# Run glances
$ glances
glances automatically highlights values that need attention in yellow (warning) or red (critical). It shows CPU, memory, load, disk I/O, network, processes, and more in a single screen.
# Glances in web mode (access via browser)
$ glances -w
Glances Web UI started on http://0.0.0.0:61208
# Glances client-server mode
$ glances -s # on the server
$ glances -c server-hostname # from the client
Distro Note: On minimal server installations,
glancesmay pull in many Python dependencies. In such environments,htop+iostatmay be more practical.
Debug This
A developer reports that their application is "slow" on the server. Here is what you see:
$ uptime
14:30:00 up 45 days, load average: 12.34, 11.87, 8.45
$ nproc
4
$ vmstat 1 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
12 3 524288 45312 8192 65536 450 890 5600 1200 890 2345 15 8 2 75 0
What is the problem? Walk through the diagnosis:
- Load average: 12.34 on a 4-core system. That is 3x overloaded.
- vmstat
rcolumn: 12 processes waiting for CPU. Confirms CPU contention. - vmstat
bcolumn: 3 processes blocked on I/O. - vmstat
si/so: Swapping in and out heavily (450/890 KB/s). Memory pressure. - vmstat
wa: 75% I/O wait. The CPU is mostly waiting for disk. - vmstat
free: Only 45 MB free. Very low.
Diagnosis: The system is severely memory-constrained. It is swapping heavily, which causes high I/O wait, which makes the CPU idle waiting for disk. The root cause is not CPU -- it is memory.
Fix: Find the memory-hungry process (htop, sort by MEM%), and either kill it, add more RAM, or reduce the application's memory usage.
┌──────────────────────────────────────────────────────────┐
│ What Just Happened? │
├──────────────────────────────────────────────────────────┤
│ │
│ System monitoring gives you visibility: │
│ │
│ Load Average: │
│ - Represents demand on the CPU │
│ - Compare against number of cores (nproc) │
│ - Three values: 1min, 5min, 15min trend │
│ │
│ Tools: │
│ - top/htop: real-time process monitoring │
│ - vmstat: CPU, memory, swap, I/O snapshot │
│ - iostat: disk I/O performance │
│ - mpstat: per-CPU breakdown │
│ - sar: historical data collection │
│ - glances: all-in-one modern dashboard │
│ │
│ Diagnosis pattern: │
│ 1. Check load average (overloaded?) │
│ 2. Check CPU (us/sy/wa/id in top) │
│ 3. Check memory (free, swap usage) │
│ 4. Check disk I/O (iostat %util, await) │
│ 5. Find the guilty process (sort by CPU or MEM) │
│ │
└──────────────────────────────────────────────────────────┘
Try This
-
Read top: Run
top, press1to see individual CPUs, then pressMto sort by memory. Identify the top 5 memory consumers on your system. -
vmstat trending: Run
vmstat 1 60for one minute while doing something intensive (like compiling software or runningstress --cpu 4). Watch how the numbers change. -
Historical data: Enable
sysstat, wait a few hours, then usesar -uandsar -rto view CPU and memory trends. Identify the busiest period. -
Load average experiment: Run
stress --cpu 8 --timeout 60on a multi-core system. Watch the 1-minute load average rise inuptimewhile the 15-minute average stays low. -
Bonus challenge: Write a shell script that captures
vmstat,iostat, andfreeoutput every minute for an hour, saves it to a log file, and then useawkorgrepto find the moment of peak memory usage.