Regular Expressions

Why This Matters

You have a 2 GB server log and you need to find every line where someone accessed the /api/users endpoint from an IP address starting with 10.0.. Or you need to validate that a configuration file contains properly formatted email addresses. Or you need to extract all phone numbers from a messy text dump.

You could write a custom program for each of these. Or you could write a regular expression in 30 seconds and use it with grep, sed, awk, or any programming language on the planet.

Regular expressions (regex) are a pattern language for matching text. They are one of the most powerful and universally useful tools in computing. Every text editor, every programming language, every log analysis tool supports them. Learn regex once, use it everywhere, forever.

This chapter teaches you regex from the ground up: what the symbols mean, how to combine them, and how to use them with grep for real-world text searching.

Try This Right Now

# Create a sample file to work with
cat > /tmp/regex-lab.txt << 'DATA'
john.doe@example.com
jane_smith@company.org
invalid-email@
bob@test.co.uk
192.168.1.1
10.0.0.255
300.400.500.600
127.0.0.1
ERROR: Connection timeout at 14:23:45
WARNING: Disk usage at 85%
INFO: User login successful
ERROR: File not found: /var/data/report.csv
phone: 555-123-4567
phone: (555) 123-4567
phone: 5551234567
2025-03-10 14:22:01 server01 sshd: Failed password for root from 10.0.0.5
2025-03-10 14:23:15 server01 sshd: Accepted password for alice from 192.168.1.50
DATA

# Find lines containing "ERROR"
grep "ERROR" /tmp/regex-lab.txt

# Find lines starting with a number
grep "^[0-9]" /tmp/regex-lab.txt

# Find email-like patterns
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" /tmp/regex-lab.txt

# Find IP addresses (rough pattern)
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" /tmp/regex-lab.txt

BRE vs ERE: Two Regex Flavors

Linux tools support two flavors of regular expressions:

BRE (Basic Regular Expressions) -- the default for grep and sed:

Metacharacters ? + { } ( ) | must be escaped with \ to have special meaning
Without escaping, they are literal characters

ERE (Extended Regular Expressions) -- used by grep -E (or egrep) and sed -E:

Metacharacters ? + { } ( ) | have special meaning by default
To match them literally, escape with \

# BRE: must escape + and ( )
grep 'ab\+c' file          # One or more 'b'
grep '\(abc\)\{2\}' file   # Exactly two "abc"

# ERE: cleaner, no escaping needed
grep -E 'ab+c' file        # One or more 'b'
grep -E '(abc){2}' file    # Exactly two "abc"

Recommendation: Use grep -E (ERE) for nearly everything. It is cleaner and more readable. The rest of this chapter uses ERE unless noted.

+---------------------------------------------+
|  Feature        | BRE          | ERE         |
|-----------------|--------------|-------------|
|  ?              | literal      | 0 or 1      |
|  +              | literal      | 1 or more   |
|  {n,m}          | \{n,m\}     | {n,m}       |
|  ( )            | \( \)       | ( )         |
|  |              | literal      | alternation |
|  . * ^ $ [ ]    | same         | same        |
+---------------------------------------------+

Metacharacters: The Building Blocks

The Dot: Match Any Character

. matches any single character (except newline):

echo -e "cat\ncar\ncap\ncab\ncan" | grep -E 'ca.'
# cat, car, cap, cab, can -- all match

echo -e "cat\ncoat\nct" | grep -E 'c.t'
# cat, ct does NOT match (. needs exactly one character)
# coat does NOT match (only one . so only one char between c and t)

Anchors: Where to Match

^ matches the start of a line. $ matches the end:

# Lines starting with "ERROR"
grep -E '^ERROR' /tmp/regex-lab.txt

# Lines ending with ".com"
grep -E '\.com$' /tmp/regex-lab.txt

# Lines that are exactly "127.0.0.1"
grep -E '^127\.0\.0\.1$' /tmp/regex-lab.txt

# Empty lines
grep -E '^$' /tmp/regex-lab.txt

Character Classes: Match One of a Set

[...] matches any single character in the set:

# Match vowels
echo -e "bat\nbet\nbit\nbot\nbut" | grep -E 'b[aeiou]t'
# bat, bet, bit, bot, but

# Match digits
grep -E '[0-9]' /tmp/regex-lab.txt

# Match uppercase letters
grep -E '[A-Z]' /tmp/regex-lab.txt

# Negate: match anything NOT in the set
echo -e "bat\nbet\nbit\nbot\nbut" | grep -E 'b[^aeiou]t'
# (no output -- all have vowels)

POSIX Character Classes

More portable than ranges like [A-Z] (which depend on locale):

Class	Matches
`[:alpha:]`	Letters (a-z, A-Z)
`[:digit:]`	Digits (0-9)
`[:alnum:]`	Letters and digits
`[:upper:]`	Uppercase letters
`[:lower:]`	Lowercase letters
`[:space:]`	Whitespace (space, tab, newline)
`[:punct:]`	Punctuation characters
`[:print:]`	Printable characters

# Match lines containing uppercase letters
grep -E '[[:upper:]]' /tmp/regex-lab.txt

# Match lines starting with a digit
grep -E '^[[:digit:]]' /tmp/regex-lab.txt

Note the double brackets: [[:digit:]]. The outer [] is the character class syntax; the inner [:digit:] is the POSIX class name.

Quantifiers: How Many Times

Quantifiers specify how many times the preceding element must match:

Quantifier	Meaning
`*`	Zero or more
`+`	One or more (ERE)
`?`	Zero or one (ERE)
`{n}`	Exactly n times (ERE)
`{n,}`	n or more times (ERE)
`{n,m}`	Between n and m times (ERE)

# * -- zero or more
echo -e "ac\nabc\nabbc\nabbbc" | grep -E 'ab*c'
# ac, abc, abbc, abbbc (all match -- zero or more 'b')

# + -- one or more
echo -e "ac\nabc\nabbc\nabbbc" | grep -E 'ab+c'
# abc, abbc, abbbc (NOT ac -- needs at least one 'b')

# ? -- zero or one
echo -e "color\ncolour" | grep -E 'colou?r'
# color, colour (the 'u' is optional)

# {n} -- exactly n times
echo -e "ab\naab\naaab\naaaab" | grep -E 'a{3}b'
# aaab (exactly 3 a's before b)

# {n,m} -- between n and m times
echo -e "ab\naab\naaab\naaaab" | grep -E 'a{2,3}b'
# aab, aaab (2 or 3 a's before b)

# {n,} -- n or more
echo -e "ab\naab\naaab\naaaab" | grep -E 'a{2,}b'
# aab, aaab, aaaab (2 or more a's)

Think About It: What is the difference between .* and .+? When would the distinction matter?

Alternation and Grouping

Alternation: OR

The | operator matches either the left or right pattern:

# Match ERROR or WARNING
grep -E 'ERROR|WARNING' /tmp/regex-lab.txt

# Match cat, dog, or fish
echo -e "I have a cat\nI have a dog\nI have a fish" | grep -E 'cat|dog|fish'

Grouping: Parentheses

Parentheses group parts of a pattern:

# Without grouping: matches "gray" or "grey"
echo -e "gray\ngrey\ngruy" | grep -E 'gr(a|e)y'
# gray, grey

# Group + quantifier
echo -e "ab\nabab\nababab" | grep -E '(ab){2,}'
# abab, ababab

# Match repeated words
echo -e "the the cat\na big big dog" | grep -E '([a-z]+) \1'
# (This uses backreferences -- see below)

Backreferences

Capture groups and refer back to them with \1, \2, etc.:

# Find repeated words (BRE -- backreferences work in BRE with grep)
echo -e "the the cat\na big dog" | grep '\([a-z]\+\) \1'
# the the cat

# Note: backreferences in ERE support varies by tool.
# grep -E supports them on GNU grep:
echo -e "the the cat\na big dog" | grep -E '([a-z]+) \1'

Backreferences are most useful in sed for search-and-replace (covered in Chapter 21).

Practical Examples

Example 1: Matching IP Addresses

A rough pattern for IPv4 addresses:

# Basic pattern (matches invalid IPs too, like 999.999.999.999)
grep -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/regex-lab.txt

A more precise pattern (validates 0-255 for each octet):

# Strict IPv4 validation
grep -E '^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$' /tmp/regex-lab.txt

Let us break this down:

25[0-5]           --> matches 250-255
2[0-4][0-9]       --> matches 200-249
[01]?[0-9][0-9]?  --> matches 0-199
\.                 --> literal dot
{3}               --> repeat the octet+dot pattern 3 times

# Test it
echo "192.168.1.1" | grep -E '^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
# Match

echo "300.400.500.600" | grep -E '^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
# No match (correct!)

Example 2: Matching Email Addresses

# Simplified email pattern
grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /tmp/regex-lab.txt

Breaking it down:

[a-zA-Z0-9._%+-]+     --> local part (before @): letters, digits, special chars
@                       --> literal @
[a-zA-Z0-9.-]+        --> domain name: letters, digits, dots, hyphens
\.                      --> literal dot
[a-zA-Z]{2,}          --> TLD: at least 2 letters

Example 3: Matching Log Lines

# Match timestamps like "14:23:45" or "2025-03-10 14:22:01"
grep -E '[0-9]{2}:[0-9]{2}:[0-9]{2}' /tmp/regex-lab.txt

# Match date-time format "YYYY-MM-DD HH:MM:SS"
grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' /tmp/regex-lab.txt

# Match failed SSH attempts and extract username
grep -E 'Failed password for [a-zA-Z0-9_]+' /tmp/regex-lab.txt

Example 4: Matching Phone Numbers (Multiple Formats)

# Match various phone formats
grep -E '(\(?[0-9]{3}\)?[-. ]?)?[0-9]{3}[-. ]?[0-9]{4}' /tmp/regex-lab.txt

grep Options for Regex Work

Essential grep Flags

# -E: Extended regex (always use this)
grep -E 'pattern' file

# -i: Case-insensitive
grep -Ei 'error|warning' /var/log/syslog

# -v: Invert match (show lines that do NOT match)
grep -Ev '^#|^$' /etc/ssh/sshd_config
# Show config without comments or blank lines

# -c: Count matching lines
grep -Ec 'ERROR' logfile

# -n: Show line numbers
grep -En 'TODO' *.py

# -l: Show only filenames (not matching lines)
grep -Erl 'password' /etc/

# -o: Show only the matching part (not the whole line)
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/regex-lab.txt

# -w: Match whole words only
echo -e "cat\ncatalog\nconcat" | grep -w 'cat'
# Only "cat" matches, not "catalog" or "concat"

# -A N: Show N lines AFTER match
grep -EA 2 'ERROR' /tmp/regex-lab.txt

# -B N: Show N lines BEFORE match
grep -EB 2 'ERROR' /tmp/regex-lab.txt

# -C N: Show N lines of context (before and after)
grep -EC 2 'ERROR' /tmp/regex-lab.txt

Combining grep with Other Tools

# Count unique IP addresses in a log
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' access.log \
    | sort | uniq -c | sort -rn | head -10

# Find all functions in a Python file
grep -En '^def [a-zA-Z_]+' script.py

# Find TODO/FIXME comments across a project
grep -Ern 'TODO|FIXME|HACK|XXX' /opt/myproject/ --include="*.py"

# Find config files containing a specific setting
grep -Erl 'max_connections' /etc/

Hands-On: Regex Practice

Step 1: Setup

# Create a practice log file
cat > /tmp/practice.log << 'LOG'
2025-03-10 08:00:01 INFO  Application started on port 8080
2025-03-10 08:00:02 INFO  Connected to database at 10.0.1.50:5432
2025-03-10 08:15:33 WARN  High memory usage: 82%
2025-03-10 08:30:00 INFO  Processed 1500 requests in 60s
2025-03-10 09:00:01 ERROR Connection refused to 10.0.1.50:5432
2025-03-10 09:00:05 ERROR Retry 1/3: Connection refused
2025-03-10 09:00:10 ERROR Retry 2/3: Connection refused
2025-03-10 09:00:15 ERROR Retry 3/3: Connection refused
2025-03-10 09:00:15 FATAL All retries exhausted, shutting down
2025-03-10 09:01:00 INFO  Application restarted by systemd
2025-03-10 09:01:01 INFO  Connected to database at 10.0.1.50:5432
2025-03-10 10:45:22 WARN  Slow query detected: 2340ms
2025-03-10 11:00:00 INFO  Health check: OK
2025-03-10 12:30:45 ERROR Invalid input from user_id=42: "Robert'); DROP TABLE users;--"
2025-03-10 13:00:00 INFO  Backup completed: /var/backups/db-20250310.sql.gz (2.3GB)
LOG

Step 2: Practice Queries

# 1. Find all ERROR and FATAL lines
grep -E '(ERROR|FATAL)' /tmp/practice.log

# 2. Find all IP addresses
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/practice.log

# 3. Find lines with percentage values
grep -E '[0-9]+%' /tmp/practice.log

# 4. Find timestamps between 09:00 and 10:00
grep -E '09:[0-9]{2}:[0-9]{2}' /tmp/practice.log

# 5. Find retry messages and extract the attempt number
grep -Eo 'Retry [0-9]+/[0-9]+' /tmp/practice.log

# 6. Find lines that do NOT contain INFO
grep -Ev 'INFO' /tmp/practice.log

# 7. Find the SQL injection attempt
grep -E "DROP TABLE" /tmp/practice.log

Debug This: Why Doesn't My Regex Match?

You write this command and get no output:

grep -E "Failed password for .+ from [0-9]+.[0-9]+.[0-9]+.[0-9]+" /tmp/regex-lab.txt

Problem: The . in the IP pattern matches any character, not just a literal dot. The regex works "too well" -- it matches, but it also matches things it shouldn't.

Actually, in this case it should still match the line. But let us look at a subtler bug:

# This doesn't match anything
grep -E "^ERROR:" /tmp/regex-lab.txt

Problem: The lines say ERROR: with a space before the colon. Look carefully:

ERROR: Connection timeout at 14:23:45

That is ERROR: followed by a space, but the actual text is ERROR (with no colon directly after ERROR -- the word Connection follows).

Wait, actually re-read the sample data. The line is:

ERROR: Connection timeout at 14:23:45

So grep -E "^ERROR:" should work. Let me demonstrate a real common bug instead:

# Common mistake: forgetting to escape dots in IP patterns
echo "192x168x1x1" | grep -E '192.168.1.1'
# MATCHES! Because . means "any character"

echo "192x168x1x1" | grep -E '192\.168\.1\.1'
# No match (correct -- dots must be literal)

Lesson: When matching literal dots, periods, or other metacharacters, always escape them with \.

Common regex debugging tips:

Start with a simpler pattern and gradually add complexity
Use grep -o to see exactly what is matching
Test on simple input first, then scale to real data
Remember to escape metacharacters when you want their literal form
Check BRE vs ERE -- are you using grep or grep -E?

Quick Reference

+------------------------------------------------------------+
|  REGEX QUICK REFERENCE (ERE)                                |
+------------------------------------------------------------+
|                                                             |
|  .          Any character (except newline)                  |
|  ^          Start of line                                   |
|  $          End of line                                     |
|  *          Zero or more of preceding                       |
|  +          One or more of preceding                        |
|  ?          Zero or one of preceding                        |
|  {n}        Exactly n of preceding                          |
|  {n,m}      Between n and m of preceding                    |
|  [abc]      One character from set                          |
|  [^abc]     One character NOT in set                        |
|  [a-z]      Character range                                 |
|  (abc)      Group                                           |
|  a|b        Alternation (a or b)                            |
|  \1         Backreference to group 1                        |
|  \.         Literal dot (escape metacharacters with \)      |
|                                                             |
|  [:alpha:]  Letters        [:digit:]  Digits                |
|  [:alnum:]  Alphanumeric   [:space:]  Whitespace            |
|  [:upper:]  Uppercase      [:lower:]  Lowercase             |
+------------------------------------------------------------+

What Just Happened?

+------------------------------------------------------------------+
|                     CHAPTER 20 RECAP                              |
+------------------------------------------------------------------+
|                                                                  |
|  - Regular expressions match patterns in text                    |
|  - BRE (basic) vs ERE (extended) -- use grep -E for ERE        |
|  - . matches any character; use \. for literal dot              |
|  - ^ and $ anchor to line start/end                             |
|  - [abc] character classes; [^abc] negated classes              |
|  - *, +, ?, {n,m} are quantifiers                               |
|  - ( ) groups patterns; | provides alternation                  |
|  - grep -o shows only matching text                             |
|  - grep -E 'pattern' is your go-to for searching               |
|  - Always escape metacharacters when matching literally         |
|  - Build patterns incrementally: start simple, add detail       |
|                                                                  |
+------------------------------------------------------------------+

Try This

Exercise 1: Log Analysis

Using /tmp/practice.log, write regex patterns to:

Extract all timestamps (HH:MM:SS format)
Find lines where processing took more than 1000ms
Extract all file paths (starting with /)

Exercise 2: Data Validation

Write regex patterns that validate:

A date in YYYY-MM-DD format
A 24-hour time in HH:MM format
A US ZIP code (5 digits, optionally followed by dash and 4 digits)

Test each with echo "test-string" | grep -E 'pattern'.

Linux Book: From First Boot to Production