System Monitoring Tools

Why This Matters

Your web application is slow. Users are complaining. Is the CPU maxed out? Is the server swapping to disk? Is one process consuming all the memory? Is the disk I/O saturated?

You cannot fix what you cannot see. System monitoring tools are how you see inside a running Linux system. They reveal CPU usage, memory consumption, disk activity, process states, and system load -- the vital signs of your server.

Every experienced Linux administrator has these tools at their fingertips. When a system is misbehaving, the first thing they do is open top or htop and start reading the numbers. Within 30 seconds, they know whether the problem is CPU, memory, disk, or something else entirely. This chapter teaches you to read those numbers and know what they mean.


Try This Right Now

Open a terminal and run these commands. Look at what your system is doing right now:

# How long has the system been running? What is the load?
$ uptime
 14:23:15 up 12 days,  3:45,  2 users,  load average: 0.52, 0.38, 0.41

# Quick process overview
$ top -bn1 | head -20

# Memory at a glance
$ free -h

# Disk usage
$ df -h

# What is using the CPU right now?
$ ps aux --sort=-%cpu | head -10

Understanding Load Average

Before diving into tools, you need to understand load average -- the single most common metric you will encounter.

$ uptime
 14:23:15 up 12 days,  3:45,  2 users,  load average: 0.52, 0.38, 0.41
                                                        ^^^^  ^^^^  ^^^^
                                                        1min  5min  15min

Load average represents the average number of processes that are either running on a CPU or waiting for a CPU (in a runnable state), plus processes in uninterruptible sleep (usually waiting for disk I/O).

What Do the Numbers Mean?

Think of CPU cores as checkout lanes in a supermarket:

Single-core system (1 checkout lane):
  Load 0.5:  Lane is 50% utilized. No waiting.
  Load 1.0:  Lane is 100% utilized. Exactly saturated.
  Load 2.0:  Lane is full + 1 person waiting. Overloaded.

Quad-core system (4 checkout lanes):
  Load 2.0:  2 of 4 lanes busy. 50% utilized. Fine.
  Load 4.0:  All 4 lanes busy. 100% saturated.
  Load 8.0:  All lanes full + 4 people waiting. Overloaded.

Rule of thumb:
  Load < number of cores  → System is coping fine
  Load = number of cores  → System is at capacity
  Load > number of cores  → System is overloaded
# How many CPU cores do you have?
$ nproc
4

# Or from /proc
$ grep -c ^processor /proc/cpuinfo
4

# So for this system:
# Load < 4.0 → OK
# Load = 4.0 → at capacity
# Load > 4.0 → overloaded

Reading the Three Load Average Numbers

The three numbers (1-minute, 5-minute, 15-minute) tell a story:

Pattern1min5min15minInterpretation
Stable2.02.02.0Consistent moderate load
Spike8.02.01.5Recent sudden load (investigate now)
Increasing2.04.06.0Load is decreasing (recovering)
Decreasing6.04.02.0Load is increasing (getting worse)

Wait -- the third row seems backwards. Remember: the 15-minute average is the oldest. If 15min is highest and 1min is lowest, the system was heavily loaded 15 minutes ago and is getting better.

# Detailed load information
$ cat /proc/loadavg
0.52 0.38 0.41 2/345 28547
#                ^^^^^ ^^^^
#                |     Last PID assigned
#                running/total processes

Think About It: A server has 2 CPU cores and a load average of 4.0, 3.5, 1.0. Is this getting better or worse? What might have happened recently?


top: The Classic Process Monitor

top is installed on virtually every Linux system. It provides a real-time view of processes, CPU, and memory.

$ top
top - 14:30:00 up 12 days,  3:52,  2 users,  load average: 0.52, 0.38, 0.41
Tasks: 245 total,   1 running, 244 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.2 us,  2.1 sy,  0.0 ni, 91.8 id,  0.5 wa,  0.0 hi,  0.4 si,  0.0 st
MiB Mem :  16384.0 total,   8234.5 free,   4521.2 used,   3628.3 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.  11458.4 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1234 mysql     20   0  2.3g   512m    32m S   8.5   3.1  125:34.56 mysqld
 5678 www-data  20   0  856m   234m    12m S   3.2   1.4   45:12.89 apache2
  901 root      20   0  123m    45m     8m S   1.1   0.3   12:45.67 systemd-journald

Understanding the top Header

CPU line breakdown:

FieldMeaning
us (user)Time running user processes
sy (system)Time running kernel code
ni (nice)Time running niced (lower priority) processes
id (idle)Time doing nothing
wa (I/O wait)Time waiting for disk I/O
hi (hardware interrupt)Time handling hardware interrupts
si (software interrupt)Time handling software interrupts
st (steal)Time stolen by hypervisor (VMs only)

Key indicators of problems:

  • High wa → disk I/O bottleneck
  • High us + low id → CPU-bound process
  • High sy → lots of system calls (possible I/O or context switching)
  • High st → VM is not getting enough CPU from the host

top Interactive Commands

While top is running:

KeyAction
1Toggle individual CPU cores display
MSort by memory usage
PSort by CPU usage
TSort by cumulative time
kKill a process (enter PID)
rRenice a process (change priority)
fChoose which fields to display
cToggle full command line display
HToggle thread display
qQuit

Batch Mode for Scripts

# Run top once and capture output (for scripts/logs)
$ top -bn1 | head -30

# Run top 5 times with 2-second intervals
$ top -bn5 -d2 > /tmp/top-capture.txt

htop: top Made Beautiful

htop is an enhanced, interactive process viewer. If you use only one monitoring tool, make it htop.

# Install htop
$ sudo apt install htop    # Debian/Ubuntu
$ sudo dnf install htop    # Fedora/RHEL

$ htop
  0[||||||||||||||||                    35.2%]   Tasks: 245, 128 thr; 1 running
  1[||||||                              12.5%]   Load average: 0.52 0.38 0.41
  2[||||||||||||                        28.7%]   Uptime: 12 days, 03:52:15
  3[|||||                               10.1%]
  Mem[||||||||||||||||||||         4.52G/16.0G]
  Swp[                              0K/4.00G]

  PID USER     PRI  NI  VIRT   RES   SHR S CPU%  MEM%   TIME+  Command
 1234 mysql     20   0 2354M  512M  32.4M S  8.5   3.1 125:34  /usr/sbin/mysqld
 5678 www-data  20   0  856M  234M  12.8M S  3.2   1.4  45:12  /usr/sbin/apache2

htop Advantages Over top

  • Color-coded CPU and memory bars
  • Mouse support (click to sort, scroll to navigate)
  • Horizontal and vertical scrolling (see full command lines)
  • Tree view (shows parent-child process relationships)
  • Easy process filtering and searching

htop Interactive Commands

KeyAction
F1Help
F2Setup (customize display)
F3Search by name
F4Filter processes
F5Tree view
F6Sort by column
F9Kill process (choose signal)
F10Quit
tToggle tree view
uFilter by user
pToggle program path
HToggle user threads

Filtering and Searching in htop

# Press F4, type "nginx" to show only nginx processes
# Press F3, type "python" to find the next python process
# Press u, select "www-data" to show only that user's processes

vmstat: Virtual Memory Statistics

vmstat provides a snapshot of system-wide CPU, memory, I/O, and process statistics. It is ideal for spotting trends over time.

# Run vmstat every 2 seconds, 10 times
$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 8234496 123456 3628032  0    0     5    12  125  456  5  2 92  1  0
 0  0      0 8234112 123456 3628064  0    0     0     8  118  434  3  1 95  1  0
 2  0      0 8233728 123460 3628128  0    0     0    24  145  512  8  3 88  1  0

vmstat Column Reference

ColumnMeaningWatch For
rProcesses waiting for CPU> number of cores = CPU bottleneck
bProcesses in uninterruptible sleep> 0 sustained = I/O bottleneck
swpdVirtual memory used (KB)Growing = memory pressure
siSwap in (KB/s)> 0 sustained = needs more RAM
soSwap out (KB/s)> 0 sustained = needs more RAM
biBlocks read from disk/sHigh = lots of disk reads
boBlocks written to disk/sHigh = lots of disk writes
usCPU user time %High = CPU-bound workload
syCPU system time %High = lots of system calls
idCPU idle time %Low = CPU bottleneck
waCPU I/O wait time %High = disk I/O bottleneck

Using vmstat to Diagnose

# "Is my system swapping?"
$ vmstat 1 5
# Look at si and so columns. If they are consistently > 0, you need more RAM.

# "Is my disk the bottleneck?"
# Look at the wa column and b column.
# High wa + high b = disk I/O saturation

iostat: Disk I/O Statistics

iostat shows CPU and disk I/O statistics. It is part of the sysstat package.

# Install sysstat
$ sudo apt install sysstat    # Debian/Ubuntu
$ sudo dnf install sysstat    # Fedora/RHEL

# Basic iostat
$ iostat
Linux 6.1.0 (myhost)     01/18/2025    _x86_64_    (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.23    0.00    2.14    0.52    0.00   92.11

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn
sda              12.45        45.67        123.45         0.00    4567890   12345678

# Extended stats with 2-second interval
$ iostat -x 2
Device  r/s     w/s    rkB/s    wkB/s  rrqm/s  wrqm/s  %util  await  r_await  w_await
sda     5.23   7.22    45.67   123.45    0.45    2.34   3.45   1.23    0.89     1.45

Key fields:

  • %util -- percentage of time the device was busy. 100% means saturated.
  • await -- average time (ms) for I/O requests. High values = slow disk.
  • r/s, w/s -- reads and writes per second.

mpstat: Per-CPU Statistics

# Show all CPUs individually
$ mpstat -P ALL 2
14:30:00     CPU    %usr   %nice   %sys  %iowait   %irq   %soft  %steal  %idle
14:30:02     all    5.23    0.00   2.14     0.52    0.00    0.12    0.00  91.99
14:30:02       0    8.50    0.00   3.00     1.00    0.00    0.50    0.00  87.00
14:30:02       1   12.00    0.00   4.00     0.00    0.00    0.00    0.00  84.00
14:30:02       2    1.00    0.00   0.50     0.50    0.00    0.00    0.00  98.00
14:30:02       3    0.50    0.00   1.00     0.00    0.00    0.00    0.00  98.50

This reveals whether load is spread across cores or concentrated on one (which can indicate a single-threaded bottleneck).


sar: System Activity Reporter

sar (also from sysstat) collects and reports system activity data over time. It is the closest thing to a built-in monitoring system.

# Enable data collection (usually done by sysstat package)
$ sudo systemctl enable --now sysstat

# View today's CPU data
$ sar -u
14:00:01        CPU     %user     %nice   %system   %iowait   %steal     %idle
14:10:01        all      5.23      0.00      2.14      0.52      0.00     92.11
14:20:01        all      3.45      0.00      1.78      0.34      0.00     94.43
14:30:01        all     12.67      0.00      4.56      1.23      0.00     81.54
...

# View memory usage over time
$ sar -r

# View disk I/O over time
$ sar -d

# View network statistics
$ sar -n DEV

# View a specific day's data
$ sar -u -f /var/log/sysstat/sa18    # 18th of the month

# View data for a specific time range
$ sar -u -s 14:00:00 -e 16:00:00

sar data is collected every 10 minutes by default (via a cron job or systemd timer). This gives you historical data to analyze trends and correlate with incidents.


Key /proc Files for Monitoring

The /proc filesystem is the source of truth for all monitoring tools. Here are the most useful files:

# Load average
$ cat /proc/loadavg
0.52 0.38 0.41 2/345 28547

# Memory details
$ cat /proc/meminfo
MemTotal:       16777216 kB
MemFree:         8234496 kB
MemAvailable:   11458560 kB
Buffers:          123456 kB
Cached:          3628032 kB
SwapTotal:       4194304 kB
SwapFree:        4194304 kB
...

# CPU information
$ cat /proc/cpuinfo | grep "model name" | head -1
model name    : Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

# Per-process information
$ cat /proc/1234/status    # replace 1234 with a PID
Name:   mysqld
State:  S (sleeping)
VmRSS:  524288 kB
Threads:  32
...

glances: A Modern All-in-One Monitor

glances is a cross-platform monitoring tool that combines top, iostat, iftop, and more into a single display.

# Install glances
$ sudo apt install glances        # Debian/Ubuntu
$ sudo dnf install glances        # Fedora/RHEL
$ pip install glances             # via pip

# Run glances
$ glances

glances automatically highlights values that need attention in yellow (warning) or red (critical). It shows CPU, memory, load, disk I/O, network, processes, and more in a single screen.

# Glances in web mode (access via browser)
$ glances -w
Glances Web UI started on http://0.0.0.0:61208

# Glances client-server mode
$ glances -s    # on the server
$ glances -c server-hostname    # from the client

Distro Note: On minimal server installations, glances may pull in many Python dependencies. In such environments, htop + iostat may be more practical.


Debug This

A developer reports that their application is "slow" on the server. Here is what you see:

$ uptime
 14:30:00 up 45 days,  load average: 12.34, 11.87, 8.45

$ nproc
4

$ vmstat 1 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
12  3 524288  45312   8192  65536  450  890   5600  1200  890 2345 15  8  2 75  0

What is the problem? Walk through the diagnosis:

  1. Load average: 12.34 on a 4-core system. That is 3x overloaded.
  2. vmstat r column: 12 processes waiting for CPU. Confirms CPU contention.
  3. vmstat b column: 3 processes blocked on I/O.
  4. vmstat si/so: Swapping in and out heavily (450/890 KB/s). Memory pressure.
  5. vmstat wa: 75% I/O wait. The CPU is mostly waiting for disk.
  6. vmstat free: Only 45 MB free. Very low.

Diagnosis: The system is severely memory-constrained. It is swapping heavily, which causes high I/O wait, which makes the CPU idle waiting for disk. The root cause is not CPU -- it is memory.

Fix: Find the memory-hungry process (htop, sort by MEM%), and either kill it, add more RAM, or reduce the application's memory usage.


┌──────────────────────────────────────────────────────────┐
│                  What Just Happened?                      │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  System monitoring gives you visibility:                  │
│                                                           │
│  Load Average:                                            │
│  - Represents demand on the CPU                           │
│  - Compare against number of cores (nproc)                │
│  - Three values: 1min, 5min, 15min trend                  │
│                                                           │
│  Tools:                                                   │
│  - top/htop: real-time process monitoring                 │
│  - vmstat: CPU, memory, swap, I/O snapshot                │
│  - iostat: disk I/O performance                           │
│  - mpstat: per-CPU breakdown                              │
│  - sar: historical data collection                        │
│  - glances: all-in-one modern dashboard                   │
│                                                           │
│  Diagnosis pattern:                                       │
│  1. Check load average (overloaded?)                      │
│  2. Check CPU (us/sy/wa/id in top)                        │
│  3. Check memory (free, swap usage)                       │
│  4. Check disk I/O (iostat %util, await)                  │
│  5. Find the guilty process (sort by CPU or MEM)          │
│                                                           │
└──────────────────────────────────────────────────────────┘

Try This

  1. Read top: Run top, press 1 to see individual CPUs, then press M to sort by memory. Identify the top 5 memory consumers on your system.

  2. vmstat trending: Run vmstat 1 60 for one minute while doing something intensive (like compiling software or running stress --cpu 4). Watch how the numbers change.

  3. Historical data: Enable sysstat, wait a few hours, then use sar -u and sar -r to view CPU and memory trends. Identify the busiest period.

  4. Load average experiment: Run stress --cpu 8 --timeout 60 on a multi-core system. Watch the 1-minute load average rise in uptime while the 15-minute average stays low.

  5. Bonus challenge: Write a shell script that captures vmstat, iostat, and free output every minute for an hour, saves it to a log file, and then use awk or grep to find the moment of peak memory usage.