Monitoring & Alerting Stack

Why This Matters

It is 2am and your phone buzzes. The e-commerce site is down. You SSH into the server and discover the disk is full -- something has been writing massive log files for weeks. If only someone had been watching.

Now imagine the alternative: three weeks ago, your monitoring system noticed disk usage crossing 80%. It sent an alert. You investigated, found the runaway log, set up rotation, and went back to sleep. No outage. No 2am phone call.

Monitoring is not optional for any system that matters. Without it, you are flying blind. With it, you can detect problems before they become outages, understand performance trends, and make informed capacity decisions.

This chapter builds the most popular open-source monitoring stack from the ground up: Prometheus for metrics collection, Grafana for visualization, node_exporter for system metrics, and Alertmanager for sending alerts.

Try This Right Now

Even without Prometheus, you can see what system metrics look like. Run these commands and think about which ones you would want to track over time:

# CPU load averages
$ cat /proc/loadavg
0.35 0.42 0.38 2/487 12345

# Memory usage
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7964        2134        1204         125        4625        5425
Swap:          2048          12        2036

# Disk usage
$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   28G   20G  59% /

# Network connections
$ ss -s
Total: 487
TCP:   42 (estab 12, closed 8, orphaned 0, timewait 6)

If you could see these numbers on a graph, updated every 15 seconds, over the last 30 days -- that is what this chapter builds.

Monitoring Philosophy: The Three Pillars

Modern observability rests on three pillars:

┌──────────────────────────────────────────────────────────────┐
│                  THE THREE PILLARS                            │
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │   METRICS   │  │    LOGS     │  │   TRACES    │          │
│  │             │  │             │  │             │          │
│  │ Numbers     │  │ Events      │  │ Request     │          │
│  │ over time   │  │ with        │  │ paths       │          │
│  │             │  │ context     │  │ through     │          │
│  │ CPU: 45%    │  │ "Error:     │  │ services    │          │
│  │ RAM: 62%    │  │  connection │  │             │          │
│  │ Req/s: 150  │  │  refused"   │  │ A → B → C  │          │
│  │ Latency: 8ms│  │             │  │ (200ms)     │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
│                                                              │
│  This chapter focuses on METRICS (Prometheus + Grafana).     │
│  Logs were covered in Chapter 17.                            │
│  Traces are used in microservice architectures.              │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Metrics

Metrics are numeric measurements collected at regular intervals: CPU usage, memory consumption, request rate, error count, response time. They are cheap to store, fast to query, and excellent for dashboards and alerts.

Logs

Logs are timestamped records of discrete events: "User john logged in," "Connection to database refused," "Payment processed successfully." They provide context but are expensive to store and search at scale.

Traces

Traces follow a single request as it moves through multiple services. They show where time is spent and where failures occur. Essential for microservices, less critical for single-server setups.

Prometheus Architecture

Prometheus is the de facto standard for open-source metrics monitoring. It was created at SoundCloud, inspired by Google's internal Borgmon system, and is now a graduated CNCF project.

┌──────────────────────────────────────────────────────────────┐
│                 PROMETHEUS ARCHITECTURE                       │
│                                                              │
│  ┌──────────────┐                                            │
│  │ Prometheus   │◄── scrapes ─── node_exporter (port 9100)   │
│  │ Server       │◄── scrapes ─── nginx_exporter (port 9113)  │
│  │              │◄── scrapes ─── app /metrics endpoint       │
│  │ ┌──────────┐ │                                            │
│  │ │  TSDB    │ │    Time Series Database stores all metrics │
│  │ │ (local)  │ │                                            │
│  │ └──────────┘ │                                            │
│  │ ┌──────────┐ │                                            │
│  │ │ PromQL   │ │    Query language for metrics              │
│  │ │ Engine   │ │                                            │
│  │ └──────────┘ │                                            │
│  │ ┌──────────┐ │                                            │
│  │ │ Alert    │ │    Evaluates alert rules                   │
│  │ │ Rules    │ │                                            │
│  │ └──────────┘ │                                            │
│  └──────┬───────┘                                            │
│         │                                                    │
│         ▼                                                    │
│  ┌──────────────┐         ┌──────────────┐                   │
│  │ Alertmanager │────────►│ Notifications│                   │
│  │              │         │ Email, Slack  │                   │
│  └──────────────┘         └──────────────┘                   │
│         ▲                                                    │
│  ┌──────────────┐                                            │
│  │   Grafana    │  Queries Prometheus for dashboard data     │
│  │              │                                            │
│  └──────────────┘                                            │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Key Design Decisions

Pull-based scraping: Prometheus pulls metrics from targets at regular intervals (default: every 15 seconds). Targets expose metrics on HTTP endpoints. This is fundamentally different from push-based systems where agents send metrics to a central server.

Local time-series database (TSDB): Prometheus stores data on local disk in an efficient, compressed format. No external database needed.

Multi-dimensional data model: Every metric has a name and a set of key-value labels:

http_requests_total{method="GET", handler="/api/users", status="200"} 14523
http_requests_total{method="POST", handler="/api/users", status="201"} 892
http_requests_total{method="GET", handler="/api/users", status="500"} 3

Think About It: Why would a pull-based model be preferred over push-based for monitoring? Think about what happens when a target goes down -- how does each model detect it?

Installing node_exporter

node_exporter exposes Linux system metrics in Prometheus format. It is the first thing to install on any server you want to monitor.

Download and Install

# Download node_exporter
$ wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# Extract
$ tar xzf node_exporter-1.7.0.linux-amd64.tar.gz

# Move binary
$ sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create a system user
$ sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter

Create a systemd Service

$ sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
Wants=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now node_exporter

Verify It Works

$ curl -s http://localhost:9100/metrics | head -20

# HELP go_gc_duration_seconds A summary of pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.5717e-05
...
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="system"} 4567.89
node_cpu_seconds_total{cpu="0",mode="user"} 7890.12

Those lines are Prometheus metrics. Every metric has a name, optional labels, and a numeric value. node_exporter exposes hundreds of metrics covering CPU, memory, disk, network, filesystem, and more.

Installing Prometheus

Download and Install

# Download Prometheus
$ wget https://github.com/prometheus/prometheus/releases/download/v2.49.0/prometheus-2.49.0.linux-amd64.tar.gz

# Extract
$ tar xzf prometheus-2.49.0.linux-amd64.tar.gz

# Move binaries
$ sudo mv prometheus-2.49.0.linux-amd64/prometheus /usr/local/bin/
$ sudo mv prometheus-2.49.0.linux-amd64/promtool /usr/local/bin/

# Create directories
$ sudo mkdir -p /etc/prometheus /var/lib/prometheus

# Copy console libraries
$ sudo mv prometheus-2.49.0.linux-amd64/consoles /etc/prometheus/
$ sudo mv prometheus-2.49.0.linux-amd64/console_libraries /etc/prometheus/

# Create system user
$ sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus
$ sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

Configure Prometheus

$ sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s          # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate alert rules
  scrape_timeout: 10s           # Timeout for each scrape

scrape_configs:
  - job_name: 'prometheus'      # Monitor Prometheus itself
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'            # Monitor system via node_exporter
    static_configs:
      - targets: ['localhost:9100']
        labels:
          instance: 'my-server'
          environment: 'production'
EOF

$ sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

Create a systemd Service

$ sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Monitoring System
After=network-online.target
Wants=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/ \
    --storage.tsdb.retention.time=30d \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target
EOF

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now prometheus

Verify Prometheus Is Running

$ curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool

You should see your targets listed with "health": "up".

Open http://localhost:9090 in a browser to access the Prometheus web UI.

PromQL Basics

PromQL (Prometheus Query Language) is how you ask questions about your metrics. You will use it in Prometheus's web UI and in Grafana dashboards.

Simple Queries

# Current CPU usage across all modes
node_cpu_seconds_total

# Filter by label: only idle CPU time
node_cpu_seconds_total{mode="idle"}

# Current memory available in bytes
node_memory_MemAvailable_bytes

# Filesystem usage
node_filesystem_avail_bytes{mountpoint="/"}

Rate and Aggregation

# CPU usage rate over last 5 minutes (per second)
rate(node_cpu_seconds_total{mode="idle"}[5m])

# Total CPU utilization percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Disk I/O rate (bytes per second read)
rate(node_disk_read_bytes_total[5m])

# Network traffic rate (bytes per second received)
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Try It in Prometheus UI

Navigate to http://localhost:9090/graph, type a query, and click "Execute." Switch to the "Graph" tab to see the metric over time.

# Good first query -- see memory usage as a percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Installing Grafana

Grafana provides beautiful, interactive dashboards for your metrics.

On Debian/Ubuntu

$ sudo apt-get install -y apt-transport-https software-properties-common
$ wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
$ echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" \
    | sudo tee /etc/apt/sources.list.d/grafana.list
$ sudo apt-get update
$ sudo apt-get install -y grafana
$ sudo systemctl enable --now grafana-server

On RHEL/Fedora

$ sudo tee /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
EOF
$ sudo dnf install -y grafana
$ sudo systemctl enable --now grafana-server

Grafana runs on port 3000. Default credentials: admin / admin (you will be forced to change the password on first login).

Hands-On: Connect Grafana to Prometheus

Step 1: Open Grafana at http://localhost:3000 and log in.

Step 2: Add Prometheus as a data source:

Navigate to Configuration (gear icon) > Data Sources > Add data source
Select "Prometheus"
Set URL to http://localhost:9090
Click "Save & Test" -- you should see "Data source is working"

Step 3: Import a pre-built dashboard for node_exporter:

Navigate to Dashboards > Import
Enter dashboard ID: 1860 (Node Exporter Full)
Select your Prometheus data source
Click "Import"

You should immediately see a dashboard with CPU usage, memory, disk I/O, network traffic, and dozens of other system metrics, with data going back to when you started Prometheus.

Step 4: Create a custom dashboard panel:

Click "+ New dashboard" > "Add visualization"
Select your Prometheus data source
In the query field, enter:

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Set the panel title to "Memory Usage %"
Under Standard options, set Unit to "Percent (0-100)"
Click "Apply"

You now have a live memory usage graph.

Think About It: You want to monitor 50 servers with Prometheus and Grafana. Would you install Prometheus on each server, or have one central Prometheus instance scrape all 50? What are the trade-offs?

Alertmanager: Sending Alerts

Monitoring without alerting means someone has to watch dashboards all day. Alertmanager sends notifications when metrics cross thresholds.

Install Alertmanager

$ wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
$ tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
$ sudo mv alertmanager-0.27.0.linux-amd64/alertmanager /usr/local/bin/
$ sudo mv alertmanager-0.27.0.linux-amd64/amtool /usr/local/bin/
$ sudo mkdir -p /etc/alertmanager
$ sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager

Configure Alertmanager

$ sudo tee /etc/alertmanager/alertmanager.yml << 'EOF'
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s           # Wait before sending first notification
  group_interval: 10m       # Wait between notifications for same group
  repeat_interval: 3h       # Re-send if alert still firing
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:5001/alert'   # Example webhook endpoint

  - name: 'critical-alerts'
    webhook_configs:
      - url: 'http://localhost:5001/critical'
EOF

For email notifications, configure the email_configs receiver:

receivers:
  - name: 'email-alerts'
    email_configs:
      - to: 'ops-team@example.com'
        from: 'prometheus@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'prometheus@example.com'
        auth_password: 'smtp-password'

Create a systemd Service for Alertmanager

$ sudo tee /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Prometheus Alertmanager
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --storage.path=/var/lib/alertmanager/

[Install]
WantedBy=multi-user.target
EOF

$ sudo mkdir -p /var/lib/alertmanager
$ sudo chown alertmanager:alertmanager /var/lib/alertmanager
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now alertmanager

Defining Alert Rules

Alert rules are defined in Prometheus, not Alertmanager. Prometheus evaluates rules and sends firing alerts to Alertmanager.

Create Alert Rules

$ sudo tee /etc/prometheus/alert_rules.yml << 'EOF'
groups:
  - name: system_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for more than 5 minutes (current: {{ $value }}%)"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 85% (current: {{ $value }}%)"

      - alert: DiskSpaceRunningLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk space running low"
          description: "Root filesystem has less than 15% free space (current: {{ $value }}%)"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} target {{ $labels.instance }} has been unreachable for 1 minute"

      - alert: HighNetworkTraffic
        expr: rate(node_network_receive_bytes_total{device="eth0"}[5m]) > 100000000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High inbound network traffic"
          description: "Network receive rate exceeds 100MB/s for 5 minutes"
EOF

Update Prometheus Configuration

Add the alert rules file and Alertmanager connection to prometheus.yml:

$ sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
EOF

$ sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
$ sudo systemctl restart prometheus

Verify Alert Rules

# Check rule syntax
$ promtool check rules /etc/prometheus/alert_rules.yml

Checking /etc/prometheus/alert_rules.yml
  SUCCESS: 5 rules found

Visit http://localhost:9090/alerts to see alert states: inactive (green), pending (yellow), or firing (red).

The Complete Stack: How It All Fits Together

┌──────────────────────────────────────────────────────────────┐
│              COMPLETE MONITORING STACK                        │
│                                                              │
│  YOUR SERVERS                                                │
│  ┌──────────────────────┐   ┌──────────────────────┐         │
│  │ Server 1             │   │ Server 2             │         │
│  │ ┌──────────────────┐ │   │ ┌──────────────────┐ │         │
│  │ │  node_exporter   │ │   │ │  node_exporter   │ │         │
│  │ │  :9100           │ │   │ │  :9100           │ │         │
│  │ └──────────────────┘ │   │ └──────────────────┘ │         │
│  └──────────┬───────────┘   └──────────┬───────────┘         │
│             │    scrape every 15s       │                     │
│             ▼                           ▼                     │
│  ┌──────────────────────────────────────────────────┐        │
│  │            Prometheus :9090                       │        │
│  │  • Scrapes targets                                │        │
│  │  • Stores metrics in TSDB                         │        │
│  │  • Evaluates alert rules                          │        │
│  │  • Serves PromQL queries                          │        │
│  └────────┬─────────────────────┬────────────────────┘        │
│           │                     │                             │
│           ▼                     ▼                             │
│  ┌────────────────┐    ┌────────────────┐                    │
│  │ Alertmanager   │    │   Grafana      │                    │
│  │ :9093          │    │   :3000        │                    │
│  │                │    │                │                    │
│  │ Routes alerts  │    │ Dashboards     │                    │
│  │ to receivers   │    │ from PromQL    │                    │
│  └───────┬────────┘    └────────────────┘                    │
│          │                                                   │
│          ▼                                                   │
│  Email, Slack, Webhook, PagerDuty, etc.                      │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Debug This

Your monitoring stack is set up, but Prometheus shows node target as "DOWN" even though node_exporter is running. The error message says "connection refused."

$ sudo systemctl status node_exporter
● node_exporter.service - Prometheus Node Exporter
     Active: active (running)

$ curl http://localhost:9100/metrics
curl: (7) Failed to connect to localhost port 9100: Connection refused

What is going on?

Diagnosis steps:

# Check what port node_exporter is actually listening on
$ ss -tlnp | grep node_exporter

# Check if a firewall is blocking the port
$ sudo iptables -L -n | grep 9100
$ sudo ufw status

# Check if node_exporter is bound to a specific interface
$ journalctl -u node_exporter -n 20

Common causes:

node_exporter is listening on a different interface (e.g., 127.0.0.1 vs 0.0.0.0)
A firewall is blocking port 9100
Another service is using port 9100, and node_exporter silently failed to bind
SELinux or AppArmor is blocking the connection

What Just Happened?

┌──────────────────────────────────────────────────────────────┐
│                    CHAPTER 70 RECAP                           │
│──────────────────────────────────────────────────────────────│
│                                                              │
│  Monitoring = metrics + logs + traces                        │
│                                                              │
│  The Prometheus Stack:                                       │
│  • node_exporter: exposes system metrics on :9100            │
│  • Prometheus: scrapes targets, stores TSDB, evaluates rules │
│  • Grafana: visualizes metrics in dashboards                 │
│  • Alertmanager: routes alerts to email, Slack, etc.         │
│                                                              │
│  PromQL essentials:                                          │
│  • rate() for per-second rates of counters                   │
│  • avg(), sum(), max() for aggregation                       │
│  • Label filters: metric{label="value"}                      │
│                                                              │
│  Alert rules:                                                │
│  • Defined in Prometheus config                              │
│  • expr: PromQL expression that triggers the alert           │
│  • for: how long the condition must be true                  │
│  • Routed by Alertmanager to notification channels           │
│                                                              │
│  Key metrics to monitor:                                     │
│  • CPU usage, memory usage, disk space, disk I/O             │
│  • Network traffic, service availability (up/down)           │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Try This

Exercise 1: PromQL Practice

With Prometheus running, write PromQL queries for:

Total number of CPU cores
Current disk I/O write rate for all devices
Total network bytes transmitted in the last hour
System uptime in days

Exercise 2: Custom Dashboard

Create a Grafana dashboard with four panels:

CPU usage over time (line graph)
Memory usage gauge (current %)
Disk space for all mount points (bar chart)
Network traffic in/out (dual line graph)

Linux Book: From First Boot to Production