Regular Expressions
Why This Matters
You have a 2 GB server log and you need to find every line where someone accessed the
/api/users endpoint from an IP address starting with 10.0.. Or you need to validate
that a configuration file contains properly formatted email addresses. Or you need to
extract all phone numbers from a messy text dump.
You could write a custom program for each of these. Or you could write a regular
expression in 30 seconds and use it with grep, sed, awk, or any programming
language on the planet.
Regular expressions (regex) are a pattern language for matching text. They are one of the most powerful and universally useful tools in computing. Every text editor, every programming language, every log analysis tool supports them. Learn regex once, use it everywhere, forever.
This chapter teaches you regex from the ground up: what the symbols mean, how to combine
them, and how to use them with grep for real-world text searching.
Try This Right Now
# Create a sample file to work with
cat > /tmp/regex-lab.txt << 'DATA'
john.doe@example.com
jane_smith@company.org
invalid-email@
bob@test.co.uk
192.168.1.1
10.0.0.255
300.400.500.600
127.0.0.1
ERROR: Connection timeout at 14:23:45
WARNING: Disk usage at 85%
INFO: User login successful
ERROR: File not found: /var/data/report.csv
phone: 555-123-4567
phone: (555) 123-4567
phone: 5551234567
2025-03-10 14:22:01 server01 sshd: Failed password for root from 10.0.0.5
2025-03-10 14:23:15 server01 sshd: Accepted password for alice from 192.168.1.50
DATA
# Find lines containing "ERROR"
grep "ERROR" /tmp/regex-lab.txt
# Find lines starting with a number
grep "^[0-9]" /tmp/regex-lab.txt
# Find email-like patterns
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" /tmp/regex-lab.txt
# Find IP addresses (rough pattern)
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" /tmp/regex-lab.txt
BRE vs ERE: Two Regex Flavors
Linux tools support two flavors of regular expressions:
BRE (Basic Regular Expressions) -- the default for grep and sed:
- Metacharacters
? + { } ( ) |must be escaped with\to have special meaning - Without escaping, they are literal characters
ERE (Extended Regular Expressions) -- used by grep -E (or egrep) and sed -E:
- Metacharacters
? + { } ( ) |have special meaning by default - To match them literally, escape with
\
# BRE: must escape + and ( )
grep 'ab\+c' file # One or more 'b'
grep '\(abc\)\{2\}' file # Exactly two "abc"
# ERE: cleaner, no escaping needed
grep -E 'ab+c' file # One or more 'b'
grep -E '(abc){2}' file # Exactly two "abc"
Recommendation: Use grep -E (ERE) for nearly everything. It is cleaner and more
readable. The rest of this chapter uses ERE unless noted.
+---------------------------------------------+
| Feature | BRE | ERE |
|-----------------|--------------|-------------|
| ? | literal | 0 or 1 |
| + | literal | 1 or more |
| {n,m} | \{n,m\} | {n,m} |
| ( ) | \( \) | ( ) |
| | | literal | alternation |
| . * ^ $ [ ] | same | same |
+---------------------------------------------+
Metacharacters: The Building Blocks
The Dot: Match Any Character
. matches any single character (except newline):
echo -e "cat\ncar\ncap\ncab\ncan" | grep -E 'ca.'
# cat, car, cap, cab, can -- all match
echo -e "cat\ncoat\nct" | grep -E 'c.t'
# cat, ct does NOT match (. needs exactly one character)
# coat does NOT match (only one . so only one char between c and t)
Anchors: Where to Match
^ matches the start of a line. $ matches the end:
# Lines starting with "ERROR"
grep -E '^ERROR' /tmp/regex-lab.txt
# Lines ending with ".com"
grep -E '\.com$' /tmp/regex-lab.txt
# Lines that are exactly "127.0.0.1"
grep -E '^127\.0\.0\.1$' /tmp/regex-lab.txt
# Empty lines
grep -E '^$' /tmp/regex-lab.txt
Character Classes: Match One of a Set
[...] matches any single character in the set:
# Match vowels
echo -e "bat\nbet\nbit\nbot\nbut" | grep -E 'b[aeiou]t'
# bat, bet, bit, bot, but
# Match digits
grep -E '[0-9]' /tmp/regex-lab.txt
# Match uppercase letters
grep -E '[A-Z]' /tmp/regex-lab.txt
# Negate: match anything NOT in the set
echo -e "bat\nbet\nbit\nbot\nbut" | grep -E 'b[^aeiou]t'
# (no output -- all have vowels)
POSIX Character Classes
More portable than ranges like [A-Z] (which depend on locale):
| Class | Matches |
|---|---|
[:alpha:] | Letters (a-z, A-Z) |
[:digit:] | Digits (0-9) |
[:alnum:] | Letters and digits |
[:upper:] | Uppercase letters |
[:lower:] | Lowercase letters |
[:space:] | Whitespace (space, tab, newline) |
[:punct:] | Punctuation characters |
[:print:] | Printable characters |
# Match lines containing uppercase letters
grep -E '[[:upper:]]' /tmp/regex-lab.txt
# Match lines starting with a digit
grep -E '^[[:digit:]]' /tmp/regex-lab.txt
Note the double brackets: [[:digit:]]. The outer [] is the character class syntax;
the inner [:digit:] is the POSIX class name.
Quantifiers: How Many Times
Quantifiers specify how many times the preceding element must match:
| Quantifier | Meaning |
|---|---|
* | Zero or more |
+ | One or more (ERE) |
? | Zero or one (ERE) |
{n} | Exactly n times (ERE) |
{n,} | n or more times (ERE) |
{n,m} | Between n and m times (ERE) |
# * -- zero or more
echo -e "ac\nabc\nabbc\nabbbc" | grep -E 'ab*c'
# ac, abc, abbc, abbbc (all match -- zero or more 'b')
# + -- one or more
echo -e "ac\nabc\nabbc\nabbbc" | grep -E 'ab+c'
# abc, abbc, abbbc (NOT ac -- needs at least one 'b')
# ? -- zero or one
echo -e "color\ncolour" | grep -E 'colou?r'
# color, colour (the 'u' is optional)
# {n} -- exactly n times
echo -e "ab\naab\naaab\naaaab" | grep -E 'a{3}b'
# aaab (exactly 3 a's before b)
# {n,m} -- between n and m times
echo -e "ab\naab\naaab\naaaab" | grep -E 'a{2,3}b'
# aab, aaab (2 or 3 a's before b)
# {n,} -- n or more
echo -e "ab\naab\naaab\naaaab" | grep -E 'a{2,}b'
# aab, aaab, aaaab (2 or more a's)
Think About It: What is the difference between
.*and.+? When would the distinction matter?
Alternation and Grouping
Alternation: OR
The | operator matches either the left or right pattern:
# Match ERROR or WARNING
grep -E 'ERROR|WARNING' /tmp/regex-lab.txt
# Match cat, dog, or fish
echo -e "I have a cat\nI have a dog\nI have a fish" | grep -E 'cat|dog|fish'
Grouping: Parentheses
Parentheses group parts of a pattern:
# Without grouping: matches "gray" or "grey"
echo -e "gray\ngrey\ngruy" | grep -E 'gr(a|e)y'
# gray, grey
# Group + quantifier
echo -e "ab\nabab\nababab" | grep -E '(ab){2,}'
# abab, ababab
# Match repeated words
echo -e "the the cat\na big big dog" | grep -E '([a-z]+) \1'
# (This uses backreferences -- see below)
Backreferences
Capture groups and refer back to them with \1, \2, etc.:
# Find repeated words (BRE -- backreferences work in BRE with grep)
echo -e "the the cat\na big dog" | grep '\([a-z]\+\) \1'
# the the cat
# Note: backreferences in ERE support varies by tool.
# grep -E supports them on GNU grep:
echo -e "the the cat\na big dog" | grep -E '([a-z]+) \1'
Backreferences are most useful in sed for search-and-replace (covered in Chapter 21).
Practical Examples
Example 1: Matching IP Addresses
A rough pattern for IPv4 addresses:
# Basic pattern (matches invalid IPs too, like 999.999.999.999)
grep -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/regex-lab.txt
A more precise pattern (validates 0-255 for each octet):
# Strict IPv4 validation
grep -E '^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$' /tmp/regex-lab.txt
Let us break this down:
25[0-5] --> matches 250-255
2[0-4][0-9] --> matches 200-249
[01]?[0-9][0-9]? --> matches 0-199
\. --> literal dot
{3} --> repeat the octet+dot pattern 3 times
# Test it
echo "192.168.1.1" | grep -E '^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
# Match
echo "300.400.500.600" | grep -E '^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
# No match (correct!)
Example 2: Matching Email Addresses
# Simplified email pattern
grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /tmp/regex-lab.txt
Breaking it down:
[a-zA-Z0-9._%+-]+ --> local part (before @): letters, digits, special chars
@ --> literal @
[a-zA-Z0-9.-]+ --> domain name: letters, digits, dots, hyphens
\. --> literal dot
[a-zA-Z]{2,} --> TLD: at least 2 letters
Example 3: Matching Log Lines
# Match timestamps like "14:23:45" or "2025-03-10 14:22:01"
grep -E '[0-9]{2}:[0-9]{2}:[0-9]{2}' /tmp/regex-lab.txt
# Match date-time format "YYYY-MM-DD HH:MM:SS"
grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' /tmp/regex-lab.txt
# Match failed SSH attempts and extract username
grep -E 'Failed password for [a-zA-Z0-9_]+' /tmp/regex-lab.txt
Example 4: Matching Phone Numbers (Multiple Formats)
# Match various phone formats
grep -E '(\(?[0-9]{3}\)?[-. ]?)?[0-9]{3}[-. ]?[0-9]{4}' /tmp/regex-lab.txt
grep Options for Regex Work
Essential grep Flags
# -E: Extended regex (always use this)
grep -E 'pattern' file
# -i: Case-insensitive
grep -Ei 'error|warning' /var/log/syslog
# -v: Invert match (show lines that do NOT match)
grep -Ev '^#|^$' /etc/ssh/sshd_config
# Show config without comments or blank lines
# -c: Count matching lines
grep -Ec 'ERROR' logfile
# -n: Show line numbers
grep -En 'TODO' *.py
# -l: Show only filenames (not matching lines)
grep -Erl 'password' /etc/
# -o: Show only the matching part (not the whole line)
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/regex-lab.txt
# -w: Match whole words only
echo -e "cat\ncatalog\nconcat" | grep -w 'cat'
# Only "cat" matches, not "catalog" or "concat"
# -A N: Show N lines AFTER match
grep -EA 2 'ERROR' /tmp/regex-lab.txt
# -B N: Show N lines BEFORE match
grep -EB 2 'ERROR' /tmp/regex-lab.txt
# -C N: Show N lines of context (before and after)
grep -EC 2 'ERROR' /tmp/regex-lab.txt
Combining grep with Other Tools
# Count unique IP addresses in a log
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' access.log \
| sort | uniq -c | sort -rn | head -10
# Find all functions in a Python file
grep -En '^def [a-zA-Z_]+' script.py
# Find TODO/FIXME comments across a project
grep -Ern 'TODO|FIXME|HACK|XXX' /opt/myproject/ --include="*.py"
# Find config files containing a specific setting
grep -Erl 'max_connections' /etc/
Hands-On: Regex Practice
Step 1: Setup
# Create a practice log file
cat > /tmp/practice.log << 'LOG'
2025-03-10 08:00:01 INFO Application started on port 8080
2025-03-10 08:00:02 INFO Connected to database at 10.0.1.50:5432
2025-03-10 08:15:33 WARN High memory usage: 82%
2025-03-10 08:30:00 INFO Processed 1500 requests in 60s
2025-03-10 09:00:01 ERROR Connection refused to 10.0.1.50:5432
2025-03-10 09:00:05 ERROR Retry 1/3: Connection refused
2025-03-10 09:00:10 ERROR Retry 2/3: Connection refused
2025-03-10 09:00:15 ERROR Retry 3/3: Connection refused
2025-03-10 09:00:15 FATAL All retries exhausted, shutting down
2025-03-10 09:01:00 INFO Application restarted by systemd
2025-03-10 09:01:01 INFO Connected to database at 10.0.1.50:5432
2025-03-10 10:45:22 WARN Slow query detected: 2340ms
2025-03-10 11:00:00 INFO Health check: OK
2025-03-10 12:30:45 ERROR Invalid input from user_id=42: "Robert'); DROP TABLE users;--"
2025-03-10 13:00:00 INFO Backup completed: /var/backups/db-20250310.sql.gz (2.3GB)
LOG
Step 2: Practice Queries
# 1. Find all ERROR and FATAL lines
grep -E '(ERROR|FATAL)' /tmp/practice.log
# 2. Find all IP addresses
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/practice.log
# 3. Find lines with percentage values
grep -E '[0-9]+%' /tmp/practice.log
# 4. Find timestamps between 09:00 and 10:00
grep -E '09:[0-9]{2}:[0-9]{2}' /tmp/practice.log
# 5. Find retry messages and extract the attempt number
grep -Eo 'Retry [0-9]+/[0-9]+' /tmp/practice.log
# 6. Find lines that do NOT contain INFO
grep -Ev 'INFO' /tmp/practice.log
# 7. Find the SQL injection attempt
grep -E "DROP TABLE" /tmp/practice.log
Debug This: Why Doesn't My Regex Match?
You write this command and get no output:
grep -E "Failed password for .+ from [0-9]+.[0-9]+.[0-9]+.[0-9]+" /tmp/regex-lab.txt
Problem: The . in the IP pattern matches any character, not just a literal dot.
The regex works "too well" -- it matches, but it also matches things it shouldn't.
Actually, in this case it should still match the line. But let us look at a subtler bug:
# This doesn't match anything
grep -E "^ERROR:" /tmp/regex-lab.txt
Problem: The lines say ERROR: with a space before the colon. Look carefully:
ERROR: Connection timeout at 14:23:45
That is ERROR: followed by a space, but the actual text is ERROR (with no colon
directly after ERROR -- the word Connection follows).
Wait, actually re-read the sample data. The line is:
ERROR: Connection timeout at 14:23:45
So grep -E "^ERROR:" should work. Let me demonstrate a real common bug instead:
# Common mistake: forgetting to escape dots in IP patterns
echo "192x168x1x1" | grep -E '192.168.1.1'
# MATCHES! Because . means "any character"
echo "192x168x1x1" | grep -E '192\.168\.1\.1'
# No match (correct -- dots must be literal)
Lesson: When matching literal dots, periods, or other metacharacters, always escape
them with \.
Common regex debugging tips:
- Start with a simpler pattern and gradually add complexity
- Use
grep -oto see exactly what is matching - Test on simple input first, then scale to real data
- Remember to escape metacharacters when you want their literal form
- Check BRE vs ERE -- are you using
greporgrep -E?
Quick Reference
+------------------------------------------------------------+
| REGEX QUICK REFERENCE (ERE) |
+------------------------------------------------------------+
| |
| . Any character (except newline) |
| ^ Start of line |
| $ End of line |
| * Zero or more of preceding |
| + One or more of preceding |
| ? Zero or one of preceding |
| {n} Exactly n of preceding |
| {n,m} Between n and m of preceding |
| [abc] One character from set |
| [^abc] One character NOT in set |
| [a-z] Character range |
| (abc) Group |
| a|b Alternation (a or b) |
| \1 Backreference to group 1 |
| \. Literal dot (escape metacharacters with \) |
| |
| [:alpha:] Letters [:digit:] Digits |
| [:alnum:] Alphanumeric [:space:] Whitespace |
| [:upper:] Uppercase [:lower:] Lowercase |
+------------------------------------------------------------+
What Just Happened?
+------------------------------------------------------------------+
| CHAPTER 20 RECAP |
+------------------------------------------------------------------+
| |
| - Regular expressions match patterns in text |
| - BRE (basic) vs ERE (extended) -- use grep -E for ERE |
| - . matches any character; use \. for literal dot |
| - ^ and $ anchor to line start/end |
| - [abc] character classes; [^abc] negated classes |
| - *, +, ?, {n,m} are quantifiers |
| - ( ) groups patterns; | provides alternation |
| - grep -o shows only matching text |
| - grep -E 'pattern' is your go-to for searching |
| - Always escape metacharacters when matching literally |
| - Build patterns incrementally: start simple, add detail |
| |
+------------------------------------------------------------------+
Try This
Exercise 1: Log Analysis
Using /tmp/practice.log, write regex patterns to:
- Extract all timestamps (HH:MM:SS format)
- Find lines where processing took more than 1000ms
- Extract all file paths (starting with
/)
Exercise 2: Data Validation
Write regex patterns that validate:
- A date in YYYY-MM-DD format
- A 24-hour time in HH:MM format
- A US ZIP code (5 digits, optionally followed by dash and 4 digits)
Test each with echo "test-string" | grep -E 'pattern'.
Exercise 3: Config File Cleaning
Take /etc/ssh/sshd_config (or any config file with comments) and use grep -Ev to
remove all comment lines (starting with #) and blank lines in a single command.
Exercise 4: Multi-Pattern Search
Write a single grep -E command that finds all lines in /tmp/regex-lab.txt containing
either an IP address, an email address, or a phone number.
Bonus Challenge
Write a regex that matches a valid MAC address in the format AA:BB:CC:DD:EE:FF (where
each pair is a hexadecimal value). Test it against both valid and invalid MAC addresses.
Then modify it to also accept dashes (AA-BB-CC-DD-EE-FF) as separators.