awk: Pattern Scanning & Reporting

Why This Matters

You are staring at a log file with millions of lines. You need to know: what is the average response time for requests to the /api/orders endpoint? Or you have a CSV file with sales data and you need to sum the revenue column grouped by region. Or you need to reformat the output of ps aux to show only the top memory consumers with their process names and memory percentages.

grep can find lines. sed can transform text. But when you need to compute, restructure, or report on structured data, you need awk.

awk is a pattern-scanning and processing language. It automatically splits every line into fields, has variables for tracking state across lines, supports arithmetic, and has built-in constructs for conditional logic and loops. It sits right at the boundary between a Unix utility and a programming language -- and that is exactly what makes it so powerful for data processing.

Try This Right Now

# Print the 1st and 3rd fields of /etc/passwd (username and UID)
awk -F: '{print $1, $3}' /etc/passwd | head -5

# Sum a column of numbers
echo -e "10\n20\n30\n40" | awk '{sum += $1} END {print "Total:", sum}'

# Find processes using more than 1% memory
ps aux | awk '$4 > 1.0 {print $4"%", $11}'

# Count lines in a file (like wc -l)
awk 'END {print NR}' /etc/passwd

# Print lines longer than 80 characters
awk 'length > 80' /etc/services | head -5

The awk Program Structure

Every awk program follows this pattern:

pattern { action }

pattern -- a condition that selects which lines to process
action -- what to do with the selected lines (enclosed in { })

If you omit the pattern, the action applies to every line. If you omit the action, the default action is to print the line.

# Pattern only (print matching lines)
awk '/ERROR/' logfile

# Action only (applies to every line)
awk '{print $1}' logfile

# Both pattern and action
awk '/ERROR/ {print $1, $4}' logfile

The Three Sections: BEGIN, Main, END

awk '
    BEGIN { ... }      # Runs once, before processing any input
    /pattern/ { ... }  # Runs for each matching input line
    END { ... }        # Runs once, after all input is processed
'

+------------------------------------------------------------------+
|                                                                   |
|  BEGIN { setup code }     <--- Runs once before input             |
|          |                                                        |
|          v                                                        |
|  +----[ Read line ]<---+  <--- Main loop: for each line          |
|  |   pattern { action } |                                         |
|  |   pattern { action } |                                         |
|  +------->---------+    |                                         |
|          |              |                                         |
|          +--- more lines?                                         |
|          |                                                        |
|          v (no more lines)                                        |
|  END { cleanup code }     <--- Runs once after all input          |
|                                                                   |
+------------------------------------------------------------------+

Example:

awk '
    BEGIN { print "=== User Report ===" }
    /\/bin\/bash$/ { print $1 }
    END { print "=== End ===" }
' FS=: /etc/passwd

Fields: $1, $2, $NF

awk automatically splits each input line into fields. By default, the delimiter is whitespace (spaces and tabs).

Symbol	Meaning
`$0`	The entire current line
`$1`	First field
`$2`	Second field
`$NF`	Last field
`$(NF-1)`	Second-to-last field
`NF`	Number of fields on this line
`NR`	Current line number (record number)

# Sample: ps output
ps aux | head -5 | awk '{print "PID:", $2, "  CMD:", $11}'

# Print the last field of each line
echo -e "one two three\nfour five six" | awk '{print $NF}'
# three
# six

# Print line number and line content
awk '{print NR": "$0}' /etc/hostname

Changing the Field Separator

Use -F to set a custom field separator:

# Parse /etc/passwd (colon-separated)
awk -F: '{print "User:", $1, "  Shell:", $7}' /etc/passwd | head -5

# Parse CSV
echo "Alice,30,Engineering" | awk -F, '{print $1, "is in", $3}'
# Alice is in Engineering

# Multiple separator characters
echo "key=value" | awk -F= '{print "Key:", $1, "Value:", $2}'

You can also set FS in the BEGIN block:

awk 'BEGIN {FS=":"} {print $1, $3}' /etc/passwd | head -5

Built-In Variables

Variable	Meaning	Default
`FS`	Input field separator	whitespace
`OFS`	Output field separator	space
`RS`	Input record separator	newline
`ORS`	Output record separator	newline
`NR`	Current record number (across all files)	--
`NF`	Number of fields in current record	--
`FNR`	Record number in current file	--
`FILENAME`	Current input filename	--

OFS: Output Field Separator

When you use a comma in print, awk inserts the OFS between fields:

# Default OFS is space
awk -F: '{print $1, $3}' /etc/passwd | head -3
# root 0
# daemon 1
# bin 2

# Set OFS to tab
awk -F: -v OFS='\t' '{print $1, $3}' /etc/passwd | head -3
# root	0
# daemon	1
# bin	2

# Set OFS to comma (create CSV)
awk -F: -v OFS=',' '{print $1, $3, $7}' /etc/passwd | head -3
# root,0,/bin/bash
# daemon,1,/usr/sbin/nologin
# bin,2,/usr/sbin/nologin

Think About It: What is the difference between print $1, $2 (with comma) and print $1 $2 (without comma)? Try both and observe the output.

NR and FNR

# Print line numbers
awk '{print NR, $0}' /etc/hostname

# Skip the header row (line 1)
awk 'NR > 1 {print}' data.csv

# Print specific lines
awk 'NR >= 5 && NR <= 10' /etc/passwd

Patterns: Selecting Lines

Regular Expression Patterns

# Lines matching a regex
awk '/^root/' /etc/passwd

# Lines NOT matching a regex
awk '!/^#/' /etc/ssh/sshd_config

# Field-specific regex
awk -F: '$7 ~ /bash/' /etc/passwd
# Lines where field 7 contains "bash"

awk -F: '$7 !~ /nologin/' /etc/passwd
# Lines where field 7 does NOT contain "nologin"

Comparison Patterns

# Numeric comparisons
awk -F: '$3 >= 1000' /etc/passwd
# Users with UID >= 1000 (regular users)

awk -F: '$3 == 0' /etc/passwd
# Users with UID 0 (root)

# String comparisons
awk -F: '$1 == "root"' /etc/passwd

Range Patterns

# Print lines between two patterns (inclusive)
awk '/START/,/END/' file

# Print lines between line 5 and line 10
awk 'NR==5, NR==10' file

Compound Patterns

# AND
awk -F: '$3 >= 1000 && $7 ~ /bash/' /etc/passwd

# OR
awk '/ERROR/ || /FATAL/' logfile

# NOT
awk '!/^#/ && !/^$/' config.file

printf: Formatted Output

print is convenient, but printf gives you precise control over formatting:

# Basic printf (no automatic newline!)
awk '{printf "%-20s %5d\n", $1, $3}' FS=: /etc/passwd | head -5

Output:

root                     0
daemon                   1
bin                      2
sys                      3
sync                     4

Format Specifiers

Format	Meaning
`%s`	String
`%d`	Integer
`%f`	Floating point
`%e`	Scientific notation
`%x`	Hexadecimal
`%o`	Octal
`%%`	Literal percent sign

Width and Alignment

Modifier	Meaning
`%10s`	Right-aligned, 10 chars wide
`%-10s`	Left-aligned, 10 chars wide
`%05d`	Zero-padded, 5 digits
`%.2f`	Float with 2 decimal places

# Formatted table output
awk -F: '
    BEGIN { printf "%-15s %6s %s\n", "USERNAME", "UID", "SHELL" }
    $3 >= 1000 {
        printf "%-15s %6d %s\n", $1, $3, $7
    }
' /etc/passwd

Output:

USERNAME           UID SHELL
nobody           65534 /usr/sbin/nologin
user1             1000 /bin/bash
user2             1001 /bin/zsh

Conditionals and Loops in awk

awk supports full programming constructs.

if-else

awk -F: '{
    if ($3 == 0) {
        print $1, "is root"
    } else if ($3 < 1000) {
        print $1, "is a system account"
    } else {
        print $1, "is a regular user"
    }
}' /etc/passwd

Ternary Operator

awk -F: '{
    type = ($3 < 1000) ? "system" : "regular"
    print $1, type
}' /etc/passwd | head -5

for Loop

# Print each field on its own line
echo "one two three four five" | awk '{
    for (i = 1; i <= NF; i++) {
        print "Field", i":", $i
    }
}'

while Loop

# Factorial calculator
echo "5" | awk '{
    n = $1
    result = 1
    while (n > 1) {
        result *= n
        n--
    }
    print $1"! =", result
}'
# 5! = 120

Hands-On: Practical awk Examples

Setup

cat > /tmp/sales.csv << 'CSV'
Region,Product,Quantity,Price
North,Widget,100,9.99
South,Widget,150,9.99
East,Gadget,200,19.99
West,Widget,75,9.99
North,Gadget,120,19.99
South,Gadget,180,19.99
East,Widget,90,9.99
West,Gadget,60,19.99
North,Doohickey,50,29.99
South,Doohickey,80,29.99
CSV

Example 1: Total Revenue

awk -F, 'NR > 1 {
    revenue = $3 * $4
    total += revenue
}
END {
    printf "Total Revenue: $%.2f\n", total
}' /tmp/sales.csv

Example 2: Revenue by Region

awk -F, 'NR > 1 {
    region_rev[$1] += $3 * $4
}
END {
    for (region in region_rev) {
        printf "%-10s $%10.2f\n", region, region_rev[region]
    }
}' /tmp/sales.csv

Example 3: Revenue by Product

awk -F, 'NR > 1 {
    prod_qty[$2] += $3
    prod_rev[$2] += $3 * $4
}
END {
    printf "%-12s %8s %12s\n", "Product", "Qty", "Revenue"
    printf "%-12s %8s %12s\n", "-------", "---", "-------"
    for (p in prod_qty) {
        printf "%-12s %8d $%10.2f\n", p, prod_qty[p], prod_rev[p]
    }
}' /tmp/sales.csv

Example 4: Parse ps Output for Top Memory Users

ps aux | awk 'NR > 1 {
    mem[$11] += $4
}
END {
    for (proc in mem) {
        if (mem[proc] > 0.5) {
            printf "%6.1f%%  %s\n", mem[proc], proc
        }
    }
}' | sort -rn | head -10

Example 5: Log Analysis

cat > /tmp/access.log << 'LOG'
10.0.0.1 - - [10/Mar/2025:14:00:01] "GET /api/users HTTP/1.1" 200 1234 0.045
10.0.0.2 - - [10/Mar/2025:14:00:02] "POST /api/orders HTTP/1.1" 201 567 0.230
10.0.0.1 - - [10/Mar/2025:14:00:03] "GET /api/users HTTP/1.1" 200 1234 0.038
10.0.0.3 - - [10/Mar/2025:14:00:04] "GET /api/products HTTP/1.1" 200 8901 0.120
10.0.0.2 - - [10/Mar/2025:14:00:05] "GET /api/users HTTP/1.1" 200 1234 0.042
10.0.0.1 - - [10/Mar/2025:14:00:06] "POST /api/orders HTTP/1.1" 500 234 1.500
10.0.0.4 - - [10/Mar/2025:14:00:07] "GET /api/products HTTP/1.1" 200 8901 0.115
10.0.0.1 - - [10/Mar/2025:14:00:08] "GET /health HTTP/1.1" 200 2 0.001
LOG

# Average response time per endpoint
awk '{
    endpoint = $7
    time = $NF
    count[endpoint]++
    total[endpoint] += time
}
END {
    printf "%-20s %8s %10s\n", "Endpoint", "Requests", "Avg Time"
    printf "%-20s %8s %10s\n", "--------", "--------", "--------"
    for (ep in count) {
        printf "%-20s %8d %10.3fs\n", ep, count[ep], total[ep]/count[ep]
    }
}' /tmp/access.log

Example 6: Status Code Summary

awk '{
    codes[$9]++
}
END {
    for (code in codes) {
        printf "HTTP %s: %d requests\n", code, codes[code]
    }
}' /tmp/access.log | sort

Associative Arrays

awk has built-in associative arrays (similar to dictionaries/hash maps). You have already seen them in the examples above. Here are the details:

# Arrays are created by use
awk 'BEGIN {
    fruits["apple"] = 5
    fruits["banana"] = 3
    fruits["cherry"] = 8

    # Iterate over keys
    for (key in fruits) {
        print key, fruits[key]
    }

    # Check if key exists
    if ("apple" in fruits) {
        print "We have apples!"
    }

    # Delete an element
    delete fruits["banana"]

    # Length of array (GNU awk)
    print "Items:", length(fruits)
}'

Counting Pattern: The Most Common Use

# Count words in a file
awk '{
    for (i = 1; i <= NF; i++) {
        words[tolower($i)]++
    }
}
END {
    for (w in words) {
        printf "%5d %s\n", words[w], w
    }
}' /tmp/sed-lab.txt 2>/dev/null | sort -rn | head -10

Useful Built-In Functions

String Functions

# length() -- string length
echo "hello" | awk '{print length($0)}'   # 5

# substr() -- substring
echo "Hello World" | awk '{print substr($0, 7)}'   # World
echo "Hello World" | awk '{print substr($0, 1, 5)}'   # Hello

# index() -- find substring position
echo "Hello World" | awk '{print index($0, "World")}'   # 7

# split() -- split string into array
echo "a:b:c:d" | awk '{n = split($0, arr, ":"); for(i=1;i<=n;i++) print arr[i]}'

# toupper() / tolower()
echo "Hello World" | awk '{print toupper($0)}'   # HELLO WORLD
echo "Hello World" | awk '{print tolower($0)}'   # hello world

# gsub() -- global substitution (returns count of replacements)
echo "aabaa" | awk '{gsub(/a/, "X"); print}'   # XXbXX

# sub() -- substitute first occurrence only
echo "aabaa" | awk '{sub(/a/, "X"); print}'   # Xabaa

# match() -- regex match (sets RSTART and RLENGTH)
echo "Error at line 42" | awk '{match($0, /[0-9]+/); print substr($0, RSTART, RLENGTH)}'
# 42

Numeric Functions

awk 'BEGIN {
    print int(3.9)        # 3
    print sqrt(144)       # 12
    print log(2.718)      # ~1
    print sin(3.14159)    # ~0
    print rand()          # random 0-1
    srand()               # seed random number generator
}'

Think About It: Why does awk use gsub and sub for substitution instead of using the s/// syntax like sed? Think about awk's design as a programming language versus sed's design as a stream editor.

Debug This: awk Not Splitting Fields Correctly

You parse a CSV file and the fields seem wrong:

echo 'Alice,"New York, NY",30' | awk -F, '{print "Name:", $1, "City:", $2}'
# Name: Alice City: "New York

Problem: awk's -F, does not handle quoted CSV fields. The comma inside the quotes is treated as a field separator.

Solutions:

Use FPAT (GNU awk) to define what a field looks like instead of what separates fields:

echo 'Alice,"New York, NY",30' | awk -v FPAT='([^,]*)|("[^"]*")' '{
    print "Name:", $1
    print "City:", $2
    print "Age:", $3
}'

For serious CSV work, use a dedicated CSV tool like csvtool, mlr (Miller), or python -m csv.

What Just Happened?

+------------------------------------------------------------------+
|                     CHAPTER 22 RECAP                              |
+------------------------------------------------------------------+
|                                                                  |
|  - awk structure: pattern { action }                             |
|  - Fields: $1, $2, ..., $NF (automatic splitting)              |
|  - -F sets the field separator                                   |
|  - BEGIN runs before input; END runs after all input             |
|  - NR = line number, NF = number of fields                      |
|  - printf for formatted output (%-10s, %6d, %.2f)               |
|  - Associative arrays for counting and grouping                  |
|  - Built-in: length, substr, split, gsub, toupper, tolower     |
|  - Comparisons: $3 > 100, $1 == "root", $7 ~ /bash/           |
|  - awk is ideal for: column extraction, aggregation,            |
|    reformatting structured text, and simple reporting            |
|                                                                  |
+------------------------------------------------------------------+

Try This

Exercise 1: System Report

Write an awk command that parses df -h output and prints only filesystems that are more than 50% full, formatted as a clean table.

Exercise 2: CSV Analysis

Using /tmp/sales.csv, write awk commands to:

Find the region with the highest total revenue
Find the product with the highest average price
Generate a formatted report with headers, data, and totals

Exercise 3: Log Parser

Using /tmp/access.log, write an awk program that:

Counts requests per IP address
Identifies the slowest request (highest response time)
Calculates the total bytes transferred
Reports the percentage of 5xx errors

Exercise 4: /etc/passwd Analysis

Using awk, produce a report showing:

Total number of users
Number of users with bash as their shell
Number of system accounts (UID < 1000)
Number of regular accounts (UID >= 1000)
The user with the highest UID

Bonus Challenge

Write an awk program that reads /etc/passwd and generates a properly formatted HTML table with columns for Username, UID, GID, Home Directory, and Shell. Include a header row and alternating row colors using inline CSS.

Linux Book: From First Boot to Production