awk: Pattern Scanning & Reporting
Why This Matters
You are staring at a log file with millions of lines. You need to know: what is the
average response time for requests to the /api/orders endpoint? Or you have a CSV
file with sales data and you need to sum the revenue column grouped by region. Or you
need to reformat the output of ps aux to show only the top memory consumers with
their process names and memory percentages.
grep can find lines. sed can transform text. But when you need to compute,
restructure, or report on structured data, you need awk.
awk is a pattern-scanning and processing language. It automatically splits every line
into fields, has variables for tracking state across lines, supports arithmetic, and has
built-in constructs for conditional logic and loops. It sits right at the boundary
between a Unix utility and a programming language -- and that is exactly what makes it
so powerful for data processing.
Try This Right Now
# Print the 1st and 3rd fields of /etc/passwd (username and UID)
awk -F: '{print $1, $3}' /etc/passwd | head -5
# Sum a column of numbers
echo -e "10\n20\n30\n40" | awk '{sum += $1} END {print "Total:", sum}'
# Find processes using more than 1% memory
ps aux | awk '$4 > 1.0 {print $4"%", $11}'
# Count lines in a file (like wc -l)
awk 'END {print NR}' /etc/passwd
# Print lines longer than 80 characters
awk 'length > 80' /etc/services | head -5
The awk Program Structure
Every awk program follows this pattern:
pattern { action }
- pattern -- a condition that selects which lines to process
- action -- what to do with the selected lines (enclosed in
{ })
If you omit the pattern, the action applies to every line. If you omit the action, the default action is to print the line.
# Pattern only (print matching lines)
awk '/ERROR/' logfile
# Action only (applies to every line)
awk '{print $1}' logfile
# Both pattern and action
awk '/ERROR/ {print $1, $4}' logfile
The Three Sections: BEGIN, Main, END
awk '
BEGIN { ... } # Runs once, before processing any input
/pattern/ { ... } # Runs for each matching input line
END { ... } # Runs once, after all input is processed
'
+------------------------------------------------------------------+
| |
| BEGIN { setup code } <--- Runs once before input |
| | |
| v |
| +----[ Read line ]<---+ <--- Main loop: for each line |
| | pattern { action } | |
| | pattern { action } | |
| +------->---------+ | |
| | | |
| +--- more lines? |
| | |
| v (no more lines) |
| END { cleanup code } <--- Runs once after all input |
| |
+------------------------------------------------------------------+
Example:
awk '
BEGIN { print "=== User Report ===" }
/\/bin\/bash$/ { print $1 }
END { print "=== End ===" }
' FS=: /etc/passwd
Fields: $1, $2, $NF
awk automatically splits each input line into fields. By default, the delimiter is whitespace (spaces and tabs).
| Symbol | Meaning |
|---|---|
$0 | The entire current line |
$1 | First field |
$2 | Second field |
$NF | Last field |
$(NF-1) | Second-to-last field |
NF | Number of fields on this line |
NR | Current line number (record number) |
# Sample: ps output
ps aux | head -5 | awk '{print "PID:", $2, " CMD:", $11}'
# Print the last field of each line
echo -e "one two three\nfour five six" | awk '{print $NF}'
# three
# six
# Print line number and line content
awk '{print NR": "$0}' /etc/hostname
Changing the Field Separator
Use -F to set a custom field separator:
# Parse /etc/passwd (colon-separated)
awk -F: '{print "User:", $1, " Shell:", $7}' /etc/passwd | head -5
# Parse CSV
echo "Alice,30,Engineering" | awk -F, '{print $1, "is in", $3}'
# Alice is in Engineering
# Multiple separator characters
echo "key=value" | awk -F= '{print "Key:", $1, "Value:", $2}'
You can also set FS in the BEGIN block:
awk 'BEGIN {FS=":"} {print $1, $3}' /etc/passwd | head -5
Built-In Variables
| Variable | Meaning | Default |
|---|---|---|
FS | Input field separator | whitespace |
OFS | Output field separator | space |
RS | Input record separator | newline |
ORS | Output record separator | newline |
NR | Current record number (across all files) | -- |
NF | Number of fields in current record | -- |
FNR | Record number in current file | -- |
FILENAME | Current input filename | -- |
OFS: Output Field Separator
When you use a comma in print, awk inserts the OFS between fields:
# Default OFS is space
awk -F: '{print $1, $3}' /etc/passwd | head -3
# root 0
# daemon 1
# bin 2
# Set OFS to tab
awk -F: -v OFS='\t' '{print $1, $3}' /etc/passwd | head -3
# root 0
# daemon 1
# bin 2
# Set OFS to comma (create CSV)
awk -F: -v OFS=',' '{print $1, $3, $7}' /etc/passwd | head -3
# root,0,/bin/bash
# daemon,1,/usr/sbin/nologin
# bin,2,/usr/sbin/nologin
Think About It: What is the difference between
print $1, $2(with comma) andprint $1 $2(without comma)? Try both and observe the output.
NR and FNR
# Print line numbers
awk '{print NR, $0}' /etc/hostname
# Skip the header row (line 1)
awk 'NR > 1 {print}' data.csv
# Print specific lines
awk 'NR >= 5 && NR <= 10' /etc/passwd
Patterns: Selecting Lines
Regular Expression Patterns
# Lines matching a regex
awk '/^root/' /etc/passwd
# Lines NOT matching a regex
awk '!/^#/' /etc/ssh/sshd_config
# Field-specific regex
awk -F: '$7 ~ /bash/' /etc/passwd
# Lines where field 7 contains "bash"
awk -F: '$7 !~ /nologin/' /etc/passwd
# Lines where field 7 does NOT contain "nologin"
Comparison Patterns
# Numeric comparisons
awk -F: '$3 >= 1000' /etc/passwd
# Users with UID >= 1000 (regular users)
awk -F: '$3 == 0' /etc/passwd
# Users with UID 0 (root)
# String comparisons
awk -F: '$1 == "root"' /etc/passwd
Range Patterns
# Print lines between two patterns (inclusive)
awk '/START/,/END/' file
# Print lines between line 5 and line 10
awk 'NR==5, NR==10' file
Compound Patterns
# AND
awk -F: '$3 >= 1000 && $7 ~ /bash/' /etc/passwd
# OR
awk '/ERROR/ || /FATAL/' logfile
# NOT
awk '!/^#/ && !/^$/' config.file
printf: Formatted Output
print is convenient, but printf gives you precise control over formatting:
# Basic printf (no automatic newline!)
awk '{printf "%-20s %5d\n", $1, $3}' FS=: /etc/passwd | head -5
Output:
root 0
daemon 1
bin 2
sys 3
sync 4
Format Specifiers
| Format | Meaning |
|---|---|
%s | String |
%d | Integer |
%f | Floating point |
%e | Scientific notation |
%x | Hexadecimal |
%o | Octal |
%% | Literal percent sign |
Width and Alignment
| Modifier | Meaning |
|---|---|
%10s | Right-aligned, 10 chars wide |
%-10s | Left-aligned, 10 chars wide |
%05d | Zero-padded, 5 digits |
%.2f | Float with 2 decimal places |
# Formatted table output
awk -F: '
BEGIN { printf "%-15s %6s %s\n", "USERNAME", "UID", "SHELL" }
$3 >= 1000 {
printf "%-15s %6d %s\n", $1, $3, $7
}
' /etc/passwd
Output:
USERNAME UID SHELL
nobody 65534 /usr/sbin/nologin
user1 1000 /bin/bash
user2 1001 /bin/zsh
Conditionals and Loops in awk
awk supports full programming constructs.
if-else
awk -F: '{
if ($3 == 0) {
print $1, "is root"
} else if ($3 < 1000) {
print $1, "is a system account"
} else {
print $1, "is a regular user"
}
}' /etc/passwd
Ternary Operator
awk -F: '{
type = ($3 < 1000) ? "system" : "regular"
print $1, type
}' /etc/passwd | head -5
for Loop
# Print each field on its own line
echo "one two three four five" | awk '{
for (i = 1; i <= NF; i++) {
print "Field", i":", $i
}
}'
while Loop
# Factorial calculator
echo "5" | awk '{
n = $1
result = 1
while (n > 1) {
result *= n
n--
}
print $1"! =", result
}'
# 5! = 120
Hands-On: Practical awk Examples
Setup
cat > /tmp/sales.csv << 'CSV'
Region,Product,Quantity,Price
North,Widget,100,9.99
South,Widget,150,9.99
East,Gadget,200,19.99
West,Widget,75,9.99
North,Gadget,120,19.99
South,Gadget,180,19.99
East,Widget,90,9.99
West,Gadget,60,19.99
North,Doohickey,50,29.99
South,Doohickey,80,29.99
CSV
Example 1: Total Revenue
awk -F, 'NR > 1 {
revenue = $3 * $4
total += revenue
}
END {
printf "Total Revenue: $%.2f\n", total
}' /tmp/sales.csv
Example 2: Revenue by Region
awk -F, 'NR > 1 {
region_rev[$1] += $3 * $4
}
END {
for (region in region_rev) {
printf "%-10s $%10.2f\n", region, region_rev[region]
}
}' /tmp/sales.csv
Example 3: Revenue by Product
awk -F, 'NR > 1 {
prod_qty[$2] += $3
prod_rev[$2] += $3 * $4
}
END {
printf "%-12s %8s %12s\n", "Product", "Qty", "Revenue"
printf "%-12s %8s %12s\n", "-------", "---", "-------"
for (p in prod_qty) {
printf "%-12s %8d $%10.2f\n", p, prod_qty[p], prod_rev[p]
}
}' /tmp/sales.csv
Example 4: Parse ps Output for Top Memory Users
ps aux | awk 'NR > 1 {
mem[$11] += $4
}
END {
for (proc in mem) {
if (mem[proc] > 0.5) {
printf "%6.1f%% %s\n", mem[proc], proc
}
}
}' | sort -rn | head -10
Example 5: Log Analysis
cat > /tmp/access.log << 'LOG'
10.0.0.1 - - [10/Mar/2025:14:00:01] "GET /api/users HTTP/1.1" 200 1234 0.045
10.0.0.2 - - [10/Mar/2025:14:00:02] "POST /api/orders HTTP/1.1" 201 567 0.230
10.0.0.1 - - [10/Mar/2025:14:00:03] "GET /api/users HTTP/1.1" 200 1234 0.038
10.0.0.3 - - [10/Mar/2025:14:00:04] "GET /api/products HTTP/1.1" 200 8901 0.120
10.0.0.2 - - [10/Mar/2025:14:00:05] "GET /api/users HTTP/1.1" 200 1234 0.042
10.0.0.1 - - [10/Mar/2025:14:00:06] "POST /api/orders HTTP/1.1" 500 234 1.500
10.0.0.4 - - [10/Mar/2025:14:00:07] "GET /api/products HTTP/1.1" 200 8901 0.115
10.0.0.1 - - [10/Mar/2025:14:00:08] "GET /health HTTP/1.1" 200 2 0.001
LOG
# Average response time per endpoint
awk '{
endpoint = $7
time = $NF
count[endpoint]++
total[endpoint] += time
}
END {
printf "%-20s %8s %10s\n", "Endpoint", "Requests", "Avg Time"
printf "%-20s %8s %10s\n", "--------", "--------", "--------"
for (ep in count) {
printf "%-20s %8d %10.3fs\n", ep, count[ep], total[ep]/count[ep]
}
}' /tmp/access.log
Example 6: Status Code Summary
awk '{
codes[$9]++
}
END {
for (code in codes) {
printf "HTTP %s: %d requests\n", code, codes[code]
}
}' /tmp/access.log | sort
Associative Arrays
awk has built-in associative arrays (similar to dictionaries/hash maps). You have already seen them in the examples above. Here are the details:
# Arrays are created by use
awk 'BEGIN {
fruits["apple"] = 5
fruits["banana"] = 3
fruits["cherry"] = 8
# Iterate over keys
for (key in fruits) {
print key, fruits[key]
}
# Check if key exists
if ("apple" in fruits) {
print "We have apples!"
}
# Delete an element
delete fruits["banana"]
# Length of array (GNU awk)
print "Items:", length(fruits)
}'
Counting Pattern: The Most Common Use
# Count words in a file
awk '{
for (i = 1; i <= NF; i++) {
words[tolower($i)]++
}
}
END {
for (w in words) {
printf "%5d %s\n", words[w], w
}
}' /tmp/sed-lab.txt 2>/dev/null | sort -rn | head -10
Useful Built-In Functions
String Functions
# length() -- string length
echo "hello" | awk '{print length($0)}' # 5
# substr() -- substring
echo "Hello World" | awk '{print substr($0, 7)}' # World
echo "Hello World" | awk '{print substr($0, 1, 5)}' # Hello
# index() -- find substring position
echo "Hello World" | awk '{print index($0, "World")}' # 7
# split() -- split string into array
echo "a:b:c:d" | awk '{n = split($0, arr, ":"); for(i=1;i<=n;i++) print arr[i]}'
# toupper() / tolower()
echo "Hello World" | awk '{print toupper($0)}' # HELLO WORLD
echo "Hello World" | awk '{print tolower($0)}' # hello world
# gsub() -- global substitution (returns count of replacements)
echo "aabaa" | awk '{gsub(/a/, "X"); print}' # XXbXX
# sub() -- substitute first occurrence only
echo "aabaa" | awk '{sub(/a/, "X"); print}' # Xabaa
# match() -- regex match (sets RSTART and RLENGTH)
echo "Error at line 42" | awk '{match($0, /[0-9]+/); print substr($0, RSTART, RLENGTH)}'
# 42
Numeric Functions
awk 'BEGIN {
print int(3.9) # 3
print sqrt(144) # 12
print log(2.718) # ~1
print sin(3.14159) # ~0
print rand() # random 0-1
srand() # seed random number generator
}'
Think About It: Why does awk use
gsubandsubfor substitution instead of using thes///syntax like sed? Think about awk's design as a programming language versus sed's design as a stream editor.
Debug This: awk Not Splitting Fields Correctly
You parse a CSV file and the fields seem wrong:
echo 'Alice,"New York, NY",30' | awk -F, '{print "Name:", $1, "City:", $2}'
# Name: Alice City: "New York
Problem: awk's -F, does not handle quoted CSV fields. The comma inside the quotes
is treated as a field separator.
Solutions:
- Use
FPAT(GNU awk) to define what a field looks like instead of what separates fields:
echo 'Alice,"New York, NY",30' | awk -v FPAT='([^,]*)|("[^"]*")' '{
print "Name:", $1
print "City:", $2
print "Age:", $3
}'
- For serious CSV work, use a dedicated CSV tool like
csvtool,mlr(Miller), orpython -m csv.
What Just Happened?
+------------------------------------------------------------------+
| CHAPTER 22 RECAP |
+------------------------------------------------------------------+
| |
| - awk structure: pattern { action } |
| - Fields: $1, $2, ..., $NF (automatic splitting) |
| - -F sets the field separator |
| - BEGIN runs before input; END runs after all input |
| - NR = line number, NF = number of fields |
| - printf for formatted output (%-10s, %6d, %.2f) |
| - Associative arrays for counting and grouping |
| - Built-in: length, substr, split, gsub, toupper, tolower |
| - Comparisons: $3 > 100, $1 == "root", $7 ~ /bash/ |
| - awk is ideal for: column extraction, aggregation, |
| reformatting structured text, and simple reporting |
| |
+------------------------------------------------------------------+
Try This
Exercise 1: System Report
Write an awk command that parses df -h output and prints only filesystems that are
more than 50% full, formatted as a clean table.
Exercise 2: CSV Analysis
Using /tmp/sales.csv, write awk commands to:
- Find the region with the highest total revenue
- Find the product with the highest average price
- Generate a formatted report with headers, data, and totals
Exercise 3: Log Parser
Using /tmp/access.log, write an awk program that:
- Counts requests per IP address
- Identifies the slowest request (highest response time)
- Calculates the total bytes transferred
- Reports the percentage of 5xx errors
Exercise 4: /etc/passwd Analysis
Using awk, produce a report showing:
- Total number of users
- Number of users with bash as their shell
- Number of system accounts (UID < 1000)
- Number of regular accounts (UID >= 1000)
- The user with the highest UID
Bonus Challenge
Write an awk program that reads /etc/passwd and generates a properly formatted HTML
table with columns for Username, UID, GID, Home Directory, and Shell. Include a header
row and alternating row colors using inline CSS.