Disk I/O & Performance

Why This Matters

Your database queries are taking 30 seconds instead of 3. The application is not doing anything differently. The CPU is mostly idle. Memory is fine. What is going on?

You check iostat and see the disk is 100% utilized with average I/O latency of 45 milliseconds. The disk is the bottleneck. Maybe a backup job is running and saturating the disk. Maybe the working set no longer fits in the page cache. Maybe the I/O scheduler is not suited for this workload.

Disk I/O is frequently the slowest component in any system. RAM operates in nanoseconds. SSDs in microseconds. Spinning hard drives in milliseconds. That is a million-fold difference between RAM and HDD. Understanding I/O performance -- how to measure it, how to identify bottlenecks, and how to tune it -- is a fundamental skill for any Linux administrator.

Try This Right Now

# What disks do you have?
$ lsblk -d -o NAME,ROTA,SIZE,MODEL
# ROTA=1 means rotational (HDD), ROTA=0 means SSD/NVMe

# What I/O scheduler is in use?
$ cat /sys/block/sda/queue/scheduler
# or for NVMe:
$ cat /sys/block/nvme0n1/queue/scheduler

# Quick I/O stats
$ iostat -x 1 3

# Who is doing I/O right now?
$ sudo iotop -o -b -n 3 2>/dev/null || echo "Install iotop: sudo apt install iotop"

I/O Schedulers

The I/O scheduler determines the order in which disk I/O requests are served. Different schedulers optimize for different workloads.

Available Schedulers

# View available schedulers for a device
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none

# The one in brackets is currently active

┌──────────────────────────────────────────────────────────┐
│                  I/O SCHEDULERS                           │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  mq-deadline                                              │
│  - Default for most setups                                │
│  - Ensures requests are served within a deadline          │
│  - Good for: databases, mixed workloads                   │
│  - Prevents starvation of reads by heavy writes           │
│                                                           │
│  bfq (Budget Fair Queueing)                               │
│  - Fair scheduling between processes                      │
│  - Good for: desktops, interactive workloads              │
│  - Higher CPU overhead than mq-deadline                   │
│  - Best when fairness matters more than throughput         │
│                                                           │
│  kyber                                                    │
│  - Lightweight, designed for fast devices (NVMe, SSD)     │
│  - Good for: SSDs with high IOPS capability               │
│  - Minimal CPU overhead                                   │
│  - Separates reads and writes into different queues        │
│                                                           │
│  none                                                     │
│  - No scheduling at all (FIFO)                            │
│  - Good for: NVMe devices with internal schedulers        │
│  - Minimum latency, no CPU overhead                       │
│  - Best when the device itself handles scheduling         │
│                                                           │
└──────────────────────────────────────────────────────────┘

Changing the I/O Scheduler

# Change at runtime (immediate, not persistent)
$ echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
$ echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

# Verify
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none

# Make it persistent with a udev rule
$ sudo vim /etc/udev/rules.d/60-ioschedulers.rules

# Set mq-deadline for rotational (HDD) devices
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"

# Set none for non-rotational (SSD/NVMe) devices
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"

# Reload udev rules
$ sudo udevadm control --reload-rules
$ sudo udevadm trigger

Which Scheduler for Which Situation?

Device Type	Workload	Recommended Scheduler
HDD	Database	mq-deadline
HDD	Desktop/interactive	bfq
HDD	General server	mq-deadline
SSD (SATA)	Database	mq-deadline or none
SSD (SATA)	General	mq-deadline
NVMe SSD	Any	none

Think About It: Why would none be the best scheduler for NVMe SSDs? What advantage does the NVMe device have that makes kernel-level scheduling unnecessary?

iostat Deep Dive

iostat is your primary tool for understanding disk I/O performance. Let us break down every column.

# Install if needed (part of sysstat)
$ sudo apt install sysstat    # Debian/Ubuntu
$ sudo dnf install sysstat    # Fedora/RHEL

# Extended statistics, 2-second interval
$ iostat -xz 2
Linux 6.1.0 (myhost)     01/18/2025    _x86_64_    (4 CPU)

Device  r/s     w/s    rkB/s    wkB/s  rrqm/s  wrqm/s  %rrqm  %wrqm  r_await  w_await  aqu-sz  rareq-sz  wareq-sz  svctm  %util
sda     45.23  123.45  5678.90  12345.67  2.34    8.90   4.92   6.72    0.89     1.45     0.15   125.56     100.01   0.45   7.60

Column Reference

Column	Full Name	Meaning	Watch For
`r/s`	Reads per second	IOPS for reads
`w/s`	Writes per second	IOPS for writes
`rkB/s`	Read KB/sec	Read throughput
`wkB/s`	Write KB/sec	Write throughput
`rrqm/s`	Read requests merged/s	Adjacent reads merged into one
`wrqm/s`	Write requests merged/s	Adjacent writes merged into one
`r_await`	Read await (ms)	Average read latency	> 10ms (SSD) or > 20ms (HDD)
`w_await`	Write await (ms)	Average write latency	> 10ms (SSD) or > 20ms (HDD)
`aqu-sz`	Average queue size	How many I/O requests are queued	> 1 sustained = saturation
`svctm`	Service time (ms)	Time to actually service the I/O	Deprecated, use await
`%util`	Utilization	Percentage of time device was busy	> 80% = concerning

Reading iostat Like a Pro

# Scenario 1: Healthy SSD
Device  r/s     w/s    r_await  w_await  aqu-sz  %util
sda     150.0   200.0    0.15     0.25    0.05   3.50
# Low latency, low queue, low utilization. All good.

# Scenario 2: Saturated HDD
Device  r/s     w/s    r_await  w_await  aqu-sz  %util
sda     120.0    30.0   45.00    12.00    5.60  99.80
# High latency (45ms reads), deep queue, fully utilized. Bottleneck!

# Scenario 3: Write-heavy workload
Device  r/s     w/s    r_await  w_await  aqu-sz  %util
sda       5.0  500.0    0.50    15.00    7.50  85.00
# Lots of writes, high write latency, high utilization.
# Maybe a log-heavy application or database checkpoint.

iotop: Per-Process I/O

iotop shows which processes are performing the most I/O -- the I/O equivalent of top for CPU.

# Install iotop
$ sudo apt install iotop    # Debian/Ubuntu
$ sudo dnf install iotop    # Fedora/RHEL

# Run iotop (requires root)
$ sudo iotop

Total DISK READ:       5.23 M/s | Total DISK WRITE:      12.45 M/s
Current DISK READ:     5.23 M/s | Current DISK WRITE:    12.45 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 1234 be/4  mysql      3.45 M/s    8.90 M/s  0.00 %  65.23 %  mysqld
 5678 be/4  www-data   1.23 M/s    2.34 M/s  0.00 %  12.45 %  apache2
 9012 be/4  root       0.50 M/s    1.21 M/s  0.00 %   5.67 %  rsync

iotop Options

# Show only processes actually doing I/O
$ sudo iotop -o

# Batch mode (for scripting)
$ sudo iotop -b -n 5 -d 2

# Show accumulated I/O instead of bandwidth
$ sudo iotop -a

# Show specific process
$ sudo iotop -p 1234

iotop Alternative: pidstat

If iotop is not available, pidstat from the sysstat package can show per-process I/O:

# Show I/O for all processes, every 2 seconds
$ pidstat -d 2
14:30:00   PID   kB_rd/s   kB_wr/s   kB_ccwr/s  iodelay  Command
14:30:02  1234   3534.00   9114.00       0.00      15     mysqld
14:30:02  5678   1260.60   2396.40       0.00       3     apache2

fio: Benchmarking Disk Performance

fio (Flexible I/O Tester) is the standard tool for benchmarking storage performance. It can simulate virtually any I/O workload.

# Install fio
$ sudo apt install fio    # Debian/Ubuntu
$ sudo dnf install fio    # Fedora/RHEL

# Random read test (simulates database workload)
$ fio --name=random-read \
      --ioengine=libaio \
      --direct=1 \
      --bs=4k \
      --iodepth=32 \
      --numjobs=4 \
      --size=1G \
      --rw=randread \
      --runtime=30 \
      --time_based \
      --directory=/tmp

# Key output:
#   read: IOPS=45678, BW=178MiB/s
#   lat (usec): min=45, max=2345, avg=89.23

# Sequential write test (simulates log writing)
$ fio --name=seq-write \
      --ioengine=libaio \
      --direct=1 \
      --bs=128k \
      --iodepth=8 \
      --numjobs=1 \
      --size=2G \
      --rw=write \
      --runtime=30 \
      --time_based \
      --directory=/tmp

# Mixed random read/write (70/30 split, simulates OLTP)
$ fio --name=mixed-rw \
      --ioengine=libaio \
      --direct=1 \
      --bs=4k \
      --iodepth=32 \
      --numjobs=4 \
      --size=1G \
      --rw=randrw \
      --rwmixread=70 \
      --runtime=30 \
      --time_based \
      --directory=/tmp

Understanding fio Output

random-read: (groupid=0, jobs=4): err= 0: pid=1234
  read: IOPS=45.7k, BW=178MiB/s (187MB/s)(5.22GiB/30001msec)
    slat (nsec): min=1200, max=123456, avg=2345.67
    clat (usec): min=45, max=12345, avg=89.23, stdev=34.56
     lat (usec): min=47, max=12348, avg=91.57, stdev=34.78
    clat percentiles (usec):
     |  1.00th=[   52],  5.00th=[   58], 10.00th=[   62],
     | 50.00th=[   82], 90.00th=[  120], 95.00th=[  145],
     | 99.00th=[  245], 99.50th=[  334], 99.90th=[  734],
     | 99.95th=[ 1123], 99.99th=[ 2345]

Key metrics:

IOPS: Number of I/O operations per second (higher = better for random workloads)
BW (Bandwidth): Throughput in MB/s (higher = better for sequential workloads)
clat (completion latency): Time from submission to completion (lower = better)
Percentiles: p99 latency matters more than average for databases

IOPS, Throughput, and Latency

┌──────────────────────────────────────────────────────────┐
│         THE THREE I/O PERFORMANCE METRICS                 │
│                                                           │
│  IOPS (I/O Operations Per Second)                         │
│  - How many read/write operations per second              │
│  - Critical for: databases, random I/O workloads          │
│  - HDD: ~100-200 IOPS | SSD: 10K-100K+ IOPS             │
│                                                           │
│  Throughput (MB/s)                                         │
│  - How much data transferred per second                   │
│  - Critical for: streaming, backups, sequential I/O       │
│  - HDD: ~100-200 MB/s | SSD: 500-7000 MB/s               │
│                                                           │
│  Latency (ms or us)                                        │
│  - How long each operation takes                          │
│  - Critical for: user-facing applications, databases      │
│  - HDD: 5-15ms | SSD: 0.1-1ms | NVMe: 0.01-0.1ms        │
│                                                           │
│  Relationship:                                            │
│  IOPS = 1 / Latency (approximately, with queue depth 1)  │
│  Throughput = IOPS x Block Size                           │
│                                                           │
└──────────────────────────────────────────────────────────┘

I/O Wait in top

When you see high wa (I/O wait) in top, it means CPUs are idle because processes are waiting for disk I/O.

%Cpu(s):  2.1 us,  1.3 sy,  0.0 ni,  21.5 id,  75.0 wa,  0.0 hi,  0.1 si,  0.0 st
                                                  ^^^^
                                              75% I/O wait!

I/O wait is NOT CPU usage -- it means the CPU has nothing to do because it is waiting for disk. High I/O wait indicates a disk bottleneck, not a CPU problem.

Diagnosing High I/O Wait

# Step 1: Confirm I/O wait
$ top -bn1 | head -5
# Look at %wa

# Step 2: Identify the saturated device
$ iostat -xz 1 3
# Look for %util near 100%

# Step 3: Find the guilty process
$ sudo iotop -o
# Identify the process with highest disk I/O

# Step 4: Understand what it is doing
$ sudo strace -e trace=read,write,open -p <PID> 2>&1 | head -20
# Shows exactly which files the process is reading/writing

Disk Cache and Page Cache

Linux uses free RAM as a read cache for disk data. This is the page cache -- one of the most important performance features of the kernel.

┌──────────────────────────────────────────────────────────┐
│                    PAGE CACHE                             │
│                                                           │
│   Application reads file "data.db"                        │
│         │                                                 │
│         ▼                                                 │
│   ┌─── Is it in the page cache? ───┐                     │
│   │                                 │                     │
│   YES                               NO                    │
│   │                                 │                     │
│   ▼                                 ▼                     │
│   Return from RAM                Read from disk           │
│   (~100 nanoseconds)             (~10 milliseconds)       │
│                                     │                     │
│                                     ▼                     │
│                               Store in page cache         │
│                               (for future reads)          │
│                                     │                     │
│                                     ▼                     │
│                               Return to application       │
│                                                           │
│   Speed difference: ~100,000x faster from cache!          │
└──────────────────────────────────────────────────────────┘

# See page cache usage
$ free -h | grep Mem
Mem:            16Gi       4.5Gi       3.2Gi       256Mi       8.3Gi        11Gi
#                                                              ^^^^
#                                                         8.3 GB of cache

# See cache hit rate with cachestat (if available)
# Part of BCC/BPF tools
$ sudo cachestat 1
    HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
   45678     234      567   99.49%          120       8192

A 99%+ hit ratio means almost all reads come from cache -- excellent performance.

Tuning Dirty Page Writeback

When applications write data, it goes into the page cache first (dirty pages) and is flushed to disk later by background kernel threads. You can tune how quickly dirty pages are written to disk.

# View current settings
$ sysctl vm.dirty_ratio
vm.dirty_ratio = 20

$ sysctl vm.dirty_background_ratio
vm.dirty_background_ratio = 10

$ sysctl vm.dirty_expire_centisecs
vm.dirty_expire_centisecs = 3000

$ sysctl vm.dirty_writeback_centisecs
vm.dirty_writeback_centisecs = 500

Parameter	Default	Meaning
`dirty_ratio`	20	Max % of total RAM for dirty pages. Processes block when exceeded.
`dirty_background_ratio`	10	Start background writeback when dirty pages exceed this %
`dirty_expire_centisecs`	3000	Dirty pages older than 30s get written out
`dirty_writeback_centisecs`	500	Flush thread wakes up every 5s to check for dirty pages

Tuning for Different Workloads

# Database server: flush to disk more aggressively (data safety)
$ sudo sysctl -w vm.dirty_ratio=5
$ sudo sysctl -w vm.dirty_background_ratio=2

# Write-heavy batch server: allow more dirty pages (throughput)
$ sudo sysctl -w vm.dirty_ratio=40
$ sudo sysctl -w vm.dirty_background_ratio=10

# Make persistent
$ sudo vim /etc/sysctl.d/99-disk-tuning.conf
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

Think About It: If you set dirty_ratio=80 on a server with 16 GB of RAM, up to 12.8 GB of data could be in memory but not yet on disk. What happens if the power fails?

SSD vs HDD Considerations

┌───────────────────────────────────────────────────────┐
│              SSD vs HDD CHARACTERISTICS                │
├──────────────────┬──────────────┬─────────────────────┤
│  Property         │  HDD         │  SSD/NVMe           │
├──────────────────┼──────────────┼─────────────────────┤
│  Random IOPS     │  100-200     │  10K-1M+            │
│  Sequential R/W  │  100-200 MB/s│  500-7000 MB/s      │
│  Latency         │  5-15 ms     │  0.01-1 ms          │
│  Seek time       │  Yes (slow)  │  No (instant)       │
│  Write endurance │  Unlimited   │  Limited (TBW)      │
│  Cost per GB     │  Low         │  Higher             │
│  Power usage     │  Higher      │  Lower              │
│  Noise           │  Yes         │  Silent             │
│  I/O scheduler   │  mq-deadline │  none               │
│  Defrag needed   │  Yes         │  No (harmful!)      │
└──────────────────┴──────────────┴─────────────────────┘

SSD-Specific Considerations

# Check if TRIM is supported
$ sudo hdparm -I /dev/sda | grep -i trim
    *    Data Set Management TRIM supported

# Enable periodic TRIM via systemd timer
$ sudo systemctl enable --now fstrim.timer

# Or use continuous TRIM in fstab (less preferred, slight overhead)
# /dev/sda1  /  ext4  defaults,discard  0 1

# Check SSD write endurance (Total Bytes Written)
$ sudo smartctl -A /dev/sda | grep -i "total.*written"
241 Total_LBAs_Written      0x0032   099   099   ---    Old_age   512345678

WARNING: Never defragment an SSD. Defragmentation writes data unnecessarily, reducing the SSD's lifespan without improving performance (SSDs have no seek time to optimize away).

Debug This

An administrator reports that the server "feels slow" after they enabled full database logging. Here is the diagnostic data:

$ iostat -x 1 3
Device  r/s     w/s    rkB/s    wkB/s   r_await  w_await  aqu-sz  %util
sda     12.0   850.0   48.0     65000.0   0.5     25.0     21.2   99.8

Analysis:

850 writes/second at 65 MB/s write throughput -- this is a write-heavy workload.
Write latency is 25ms -- slow, indicating saturation.
Queue depth is 21.2 -- deep queue means requests are piling up.
%util is 99.8% -- the device is fully saturated.
Read performance is fine (r_await 0.5ms) when reads can get through.

The problem: Full database logging is generating massive write I/O that is saturating the disk.

Solutions:

Move the log files to a separate, faster disk (SSD/NVMe)
Reduce logging verbosity
Use async writes for logs (accept small risk of data loss on crash)
Increase dirty page ratios to batch more writes together
Upgrade to an SSD if currently on HDD

┌──────────────────────────────────────────────────────────┐
│                  What Just Happened?                      │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  I/O schedulers control request ordering:                 │
│  - mq-deadline: good for HDDs, databases                  │
│  - bfq: good for interactive/desktop                      │
│  - kyber: lightweight, for fast SSDs                      │
│  - none: best for NVMe (device does its own scheduling)  │
│                                                           │
│  Key tools:                                               │
│  - iostat: device-level I/O statistics                    │
│  - iotop: per-process I/O usage                           │
│  - fio: I/O benchmarking                                  │
│                                                           │
│  Three metrics that matter:                               │
│  - IOPS: operations per second (random workloads)         │
│  - Throughput: MB/s (sequential workloads)                │
│  - Latency: response time per operation                   │
│                                                           │
│  Page cache makes reads fast by caching in RAM.           │
│  Dirty page settings control write buffering.             │
│  I/O wait (%wa in top) = CPU idle, waiting for disk.      │
│                                                           │
└──────────────────────────────────────────────────────────┘

Try This

Identify your scheduler: Check which I/O scheduler each of your block devices is using. Is it appropriate for the device type (HDD vs SSD)?
iostat monitoring: Run iostat -xz 2 while copying a large file. Observe the %util, await, and throughput columns changing in real time.
fio benchmark: Benchmark your disk with fio using a 4K random read test. Record the IOPS and latency. Then run a sequential write test and record the throughput. How do your numbers compare to the theoretical maximums for your device type?
Find the I/O hog: Use iotop during normal system operation to identify which processes are performing the most disk I/O. Are any of them surprising?
Bonus challenge: Change the I/O scheduler on one of your devices, run the same fio benchmark, and compare results. Try mq-deadline vs bfq vs none and document the differences in IOPS and latency.

Linux Book: From First Boot to Production