Monitoring & Alerting Stack
Why This Matters
It is 2am and your phone buzzes. The e-commerce site is down. You SSH into the server and discover the disk is full -- something has been writing massive log files for weeks. If only someone had been watching.
Now imagine the alternative: three weeks ago, your monitoring system noticed disk usage crossing 80%. It sent an alert. You investigated, found the runaway log, set up rotation, and went back to sleep. No outage. No 2am phone call.
Monitoring is not optional for any system that matters. Without it, you are flying blind. With it, you can detect problems before they become outages, understand performance trends, and make informed capacity decisions.
This chapter builds the most popular open-source monitoring stack from the ground up: Prometheus for metrics collection, Grafana for visualization, node_exporter for system metrics, and Alertmanager for sending alerts.
Try This Right Now
Even without Prometheus, you can see what system metrics look like. Run these commands and think about which ones you would want to track over time:
# CPU load averages
$ cat /proc/loadavg
0.35 0.42 0.38 2/487 12345
# Memory usage
$ free -m
total used free shared buff/cache available
Mem: 7964 2134 1204 125 4625 5425
Swap: 2048 12 2036
# Disk usage
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 28G 20G 59% /
# Network connections
$ ss -s
Total: 487
TCP: 42 (estab 12, closed 8, orphaned 0, timewait 6)
If you could see these numbers on a graph, updated every 15 seconds, over the last 30 days -- that is what this chapter builds.
Monitoring Philosophy: The Three Pillars
Modern observability rests on three pillars:
┌──────────────────────────────────────────────────────────────┐
│ THE THREE PILLARS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ METRICS │ │ LOGS │ │ TRACES │ │
│ │ │ │ │ │ │ │
│ │ Numbers │ │ Events │ │ Request │ │
│ │ over time │ │ with │ │ paths │ │
│ │ │ │ context │ │ through │ │
│ │ CPU: 45% │ │ "Error: │ │ services │ │
│ │ RAM: 62% │ │ connection │ │ │ │
│ │ Req/s: 150 │ │ refused" │ │ A → B → C │ │
│ │ Latency: 8ms│ │ │ │ (200ms) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ This chapter focuses on METRICS (Prometheus + Grafana). │
│ Logs were covered in Chapter 17. │
│ Traces are used in microservice architectures. │
│ │
└──────────────────────────────────────────────────────────────┘
Metrics
Metrics are numeric measurements collected at regular intervals: CPU usage, memory consumption, request rate, error count, response time. They are cheap to store, fast to query, and excellent for dashboards and alerts.
Logs
Logs are timestamped records of discrete events: "User john logged in," "Connection to database refused," "Payment processed successfully." They provide context but are expensive to store and search at scale.
Traces
Traces follow a single request as it moves through multiple services. They show where time is spent and where failures occur. Essential for microservices, less critical for single-server setups.
Prometheus Architecture
Prometheus is the de facto standard for open-source metrics monitoring. It was created at SoundCloud, inspired by Google's internal Borgmon system, and is now a graduated CNCF project.
┌──────────────────────────────────────────────────────────────┐
│ PROMETHEUS ARCHITECTURE │
│ │
│ ┌──────────────┐ │
│ │ Prometheus │◄── scrapes ─── node_exporter (port 9100) │
│ │ Server │◄── scrapes ─── nginx_exporter (port 9113) │
│ │ │◄── scrapes ─── app /metrics endpoint │
│ │ ┌──────────┐ │ │
│ │ │ TSDB │ │ Time Series Database stores all metrics │
│ │ │ (local) │ │ │
│ │ └──────────┘ │ │
│ │ ┌──────────┐ │ │
│ │ │ PromQL │ │ Query language for metrics │
│ │ │ Engine │ │ │
│ │ └──────────┘ │ │
│ │ ┌──────────┐ │ │
│ │ │ Alert │ │ Evaluates alert rules │
│ │ │ Rules │ │ │
│ │ └──────────┘ │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Alertmanager │────────►│ Notifications│ │
│ │ │ │ Email, Slack │ │
│ └──────────────┘ └──────────────┘ │
│ ▲ │
│ ┌──────────────┐ │
│ │ Grafana │ Queries Prometheus for dashboard data │
│ │ │ │
│ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Key Design Decisions
Pull-based scraping: Prometheus pulls metrics from targets at regular intervals (default: every 15 seconds). Targets expose metrics on HTTP endpoints. This is fundamentally different from push-based systems where agents send metrics to a central server.
Local time-series database (TSDB): Prometheus stores data on local disk in an efficient, compressed format. No external database needed.
Multi-dimensional data model: Every metric has a name and a set of key-value labels:
http_requests_total{method="GET", handler="/api/users", status="200"} 14523
http_requests_total{method="POST", handler="/api/users", status="201"} 892
http_requests_total{method="GET", handler="/api/users", status="500"} 3
Think About It: Why would a pull-based model be preferred over push-based for monitoring? Think about what happens when a target goes down -- how does each model detect it?
Installing node_exporter
node_exporter exposes Linux system metrics in Prometheus format. It is the first thing to install on any server you want to monitor.
Download and Install
# Download node_exporter
$ wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
# Extract
$ tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
# Move binary
$ sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create a system user
$ sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter
Create a systemd Service
$ sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
Wants=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now node_exporter
Verify It Works
$ curl -s http://localhost:9100/metrics | head -20
# HELP go_gc_duration_seconds A summary of pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.5717e-05
...
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="system"} 4567.89
node_cpu_seconds_total{cpu="0",mode="user"} 7890.12
Those lines are Prometheus metrics. Every metric has a name, optional labels, and a numeric value. node_exporter exposes hundreds of metrics covering CPU, memory, disk, network, filesystem, and more.
Installing Prometheus
Download and Install
# Download Prometheus
$ wget https://github.com/prometheus/prometheus/releases/download/v2.49.0/prometheus-2.49.0.linux-amd64.tar.gz
# Extract
$ tar xzf prometheus-2.49.0.linux-amd64.tar.gz
# Move binaries
$ sudo mv prometheus-2.49.0.linux-amd64/prometheus /usr/local/bin/
$ sudo mv prometheus-2.49.0.linux-amd64/promtool /usr/local/bin/
# Create directories
$ sudo mkdir -p /etc/prometheus /var/lib/prometheus
# Copy console libraries
$ sudo mv prometheus-2.49.0.linux-amd64/consoles /etc/prometheus/
$ sudo mv prometheus-2.49.0.linux-amd64/console_libraries /etc/prometheus/
# Create system user
$ sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus
$ sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
Configure Prometheus
$ sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate alert rules
scrape_timeout: 10s # Timeout for each scrape
scrape_configs:
- job_name: 'prometheus' # Monitor Prometheus itself
static_configs:
- targets: ['localhost:9090']
- job_name: 'node' # Monitor system via node_exporter
static_configs:
- targets: ['localhost:9100']
labels:
instance: 'my-server'
environment: 'production'
EOF
$ sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
Create a systemd Service
$ sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Monitoring System
After=network-online.target
Wants=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
EOF
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now prometheus
Verify Prometheus Is Running
$ curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool
You should see your targets listed with "health": "up".
Open http://localhost:9090 in a browser to access the Prometheus web UI.
PromQL Basics
PromQL (Prometheus Query Language) is how you ask questions about your metrics. You will use it in Prometheus's web UI and in Grafana dashboards.
Simple Queries
# Current CPU usage across all modes
node_cpu_seconds_total
# Filter by label: only idle CPU time
node_cpu_seconds_total{mode="idle"}
# Current memory available in bytes
node_memory_MemAvailable_bytes
# Filesystem usage
node_filesystem_avail_bytes{mountpoint="/"}
Rate and Aggregation
# CPU usage rate over last 5 minutes (per second)
rate(node_cpu_seconds_total{mode="idle"}[5m])
# Total CPU utilization percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Disk I/O rate (bytes per second read)
rate(node_disk_read_bytes_total[5m])
# Network traffic rate (bytes per second received)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Try It in Prometheus UI
Navigate to http://localhost:9090/graph, type a query, and click "Execute." Switch to the "Graph" tab to see the metric over time.
# Good first query -- see memory usage as a percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Installing Grafana
Grafana provides beautiful, interactive dashboards for your metrics.
On Debian/Ubuntu
$ sudo apt-get install -y apt-transport-https software-properties-common
$ wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
$ echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" \
| sudo tee /etc/apt/sources.list.d/grafana.list
$ sudo apt-get update
$ sudo apt-get install -y grafana
$ sudo systemctl enable --now grafana-server
On RHEL/Fedora
$ sudo tee /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
EOF
$ sudo dnf install -y grafana
$ sudo systemctl enable --now grafana-server
Grafana runs on port 3000. Default credentials: admin / admin (you will be forced to change the password on first login).
Hands-On: Connect Grafana to Prometheus
Step 1: Open Grafana at http://localhost:3000 and log in.
Step 2: Add Prometheus as a data source:
- Navigate to Configuration (gear icon) > Data Sources > Add data source
- Select "Prometheus"
- Set URL to
http://localhost:9090 - Click "Save & Test" -- you should see "Data source is working"
Step 3: Import a pre-built dashboard for node_exporter:
- Navigate to Dashboards > Import
- Enter dashboard ID: 1860 (Node Exporter Full)
- Select your Prometheus data source
- Click "Import"
You should immediately see a dashboard with CPU usage, memory, disk I/O, network traffic, and dozens of other system metrics, with data going back to when you started Prometheus.
Step 4: Create a custom dashboard panel:
- Click "+ New dashboard" > "Add visualization"
- Select your Prometheus data source
- In the query field, enter:
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
- Set the panel title to "Memory Usage %"
- Under Standard options, set Unit to "Percent (0-100)"
- Click "Apply"
You now have a live memory usage graph.
Think About It: You want to monitor 50 servers with Prometheus and Grafana. Would you install Prometheus on each server, or have one central Prometheus instance scrape all 50? What are the trade-offs?
Alertmanager: Sending Alerts
Monitoring without alerting means someone has to watch dashboards all day. Alertmanager sends notifications when metrics cross thresholds.
Install Alertmanager
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
$ tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
$ sudo mv alertmanager-0.27.0.linux-amd64/alertmanager /usr/local/bin/
$ sudo mv alertmanager-0.27.0.linux-amd64/amtool /usr/local/bin/
$ sudo mkdir -p /etc/alertmanager
$ sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager
Configure Alertmanager
$ sudo tee /etc/alertmanager/alertmanager.yml << 'EOF'
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s # Wait before sending first notification
group_interval: 10m # Wait between notifications for same group
repeat_interval: 3h # Re-send if alert still firing
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://localhost:5001/alert' # Example webhook endpoint
- name: 'critical-alerts'
webhook_configs:
- url: 'http://localhost:5001/critical'
EOF
For email notifications, configure the email_configs receiver:
receivers:
- name: 'email-alerts'
email_configs:
- to: 'ops-team@example.com'
from: 'prometheus@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'prometheus@example.com'
auth_password: 'smtp-password'
Create a systemd Service for Alertmanager
$ sudo tee /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Prometheus Alertmanager
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/
[Install]
WantedBy=multi-user.target
EOF
$ sudo mkdir -p /var/lib/alertmanager
$ sudo chown alertmanager:alertmanager /var/lib/alertmanager
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now alertmanager
Defining Alert Rules
Alert rules are defined in Prometheus, not Alertmanager. Prometheus evaluates rules and sends firing alerts to Alertmanager.
Create Alert Rules
$ sudo tee /etc/prometheus/alert_rules.yml << 'EOF'
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes (current: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% (current: {{ $value }}%)"
- alert: DiskSpaceRunningLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space running low"
description: "Root filesystem has less than 15% free space (current: {{ $value }}%)"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} target {{ $labels.instance }} has been unreachable for 1 minute"
- alert: HighNetworkTraffic
expr: rate(node_network_receive_bytes_total{device="eth0"}[5m]) > 100000000
for: 5m
labels:
severity: warning
annotations:
summary: "High inbound network traffic"
description: "Network receive rate exceeds 100MB/s for 5 minutes"
EOF
Update Prometheus Configuration
Add the alert rules file and Alertmanager connection to prometheus.yml:
$ sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
EOF
$ sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
$ sudo systemctl restart prometheus
Verify Alert Rules
# Check rule syntax
$ promtool check rules /etc/prometheus/alert_rules.yml
Checking /etc/prometheus/alert_rules.yml
SUCCESS: 5 rules found
Visit http://localhost:9090/alerts to see alert states: inactive (green), pending (yellow), or firing (red).
The Complete Stack: How It All Fits Together
┌──────────────────────────────────────────────────────────────┐
│ COMPLETE MONITORING STACK │
│ │
│ YOUR SERVERS │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Server 1 │ │ Server 2 │ │
│ │ ┌──────────────────┐ │ │ ┌──────────────────┐ │ │
│ │ │ node_exporter │ │ │ │ node_exporter │ │ │
│ │ │ :9100 │ │ │ │ :9100 │ │ │
│ │ └──────────────────┘ │ │ └──────────────────┘ │ │
│ └──────────┬───────────┘ └──────────┬───────────┘ │
│ │ scrape every 15s │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Prometheus :9090 │ │
│ │ • Scrapes targets │ │
│ │ • Stores metrics in TSDB │ │
│ │ • Evaluates alert rules │ │
│ │ • Serves PromQL queries │ │
│ └────────┬─────────────────────┬────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Alertmanager │ │ Grafana │ │
│ │ :9093 │ │ :3000 │ │
│ │ │ │ │ │
│ │ Routes alerts │ │ Dashboards │ │
│ │ to receivers │ │ from PromQL │ │
│ └───────┬────────┘ └────────────────┘ │
│ │ │
│ ▼ │
│ Email, Slack, Webhook, PagerDuty, etc. │
│ │
└──────────────────────────────────────────────────────────────┘
Debug This
Your monitoring stack is set up, but Prometheus shows node target as "DOWN" even though node_exporter is running. The error message says "connection refused."
$ sudo systemctl status node_exporter
● node_exporter.service - Prometheus Node Exporter
Active: active (running)
$ curl http://localhost:9100/metrics
curl: (7) Failed to connect to localhost port 9100: Connection refused
What is going on?
Diagnosis steps:
# Check what port node_exporter is actually listening on
$ ss -tlnp | grep node_exporter
# Check if a firewall is blocking the port
$ sudo iptables -L -n | grep 9100
$ sudo ufw status
# Check if node_exporter is bound to a specific interface
$ journalctl -u node_exporter -n 20
Common causes:
- node_exporter is listening on a different interface (e.g., 127.0.0.1 vs 0.0.0.0)
- A firewall is blocking port 9100
- Another service is using port 9100, and node_exporter silently failed to bind
- SELinux or AppArmor is blocking the connection
What Just Happened?
┌──────────────────────────────────────────────────────────────┐
│ CHAPTER 70 RECAP │
│──────────────────────────────────────────────────────────────│
│ │
│ Monitoring = metrics + logs + traces │
│ │
│ The Prometheus Stack: │
│ • node_exporter: exposes system metrics on :9100 │
│ • Prometheus: scrapes targets, stores TSDB, evaluates rules │
│ • Grafana: visualizes metrics in dashboards │
│ • Alertmanager: routes alerts to email, Slack, etc. │
│ │
│ PromQL essentials: │
│ • rate() for per-second rates of counters │
│ • avg(), sum(), max() for aggregation │
│ • Label filters: metric{label="value"} │
│ │
│ Alert rules: │
│ • Defined in Prometheus config │
│ • expr: PromQL expression that triggers the alert │
│ • for: how long the condition must be true │
│ • Routed by Alertmanager to notification channels │
│ │
│ Key metrics to monitor: │
│ • CPU usage, memory usage, disk space, disk I/O │
│ • Network traffic, service availability (up/down) │
│ │
└──────────────────────────────────────────────────────────────┘
Try This
Exercise 1: PromQL Practice
With Prometheus running, write PromQL queries for:
- Total number of CPU cores
- Current disk I/O write rate for all devices
- Total network bytes transmitted in the last hour
- System uptime in days
Exercise 2: Custom Dashboard
Create a Grafana dashboard with four panels:
- CPU usage over time (line graph)
- Memory usage gauge (current %)
- Disk space for all mount points (bar chart)
- Network traffic in/out (dual line graph)
Exercise 3: Alert Testing
Create an alert rule that fires when filesystem usage exceeds 50% (a threshold low enough to test easily). Verify it appears in the Prometheus alerts page. Try to trigger it by creating a large temporary file.
Exercise 4: Monitor a Service
Install nginx and configure Prometheus to scrape nginx metrics. You will need the nginx-prometheus-exporter. Set up an alert for when nginx is down.
Bonus Challenge
Set up monitoring for multiple machines. Install node_exporter on a second machine (or VM or container), add it to your Prometheus config, and create a Grafana dashboard that compares CPU and memory usage across both machines side by side.