Text Processing Toolkit
Why This Matters
The power of Linux lies not in any single tool, but in how tools combine. Each utility
in this chapter does one thing well: sort sorts, uniq deduplicates, cut extracts
columns, tr translates characters. Alone, each is simple. Piped together, they become
a data processing pipeline that can rival purpose-built programs.
Need to find the top 10 most active IP addresses in a web server log? That is awk,
sort, uniq -c, and head piped together. Need to compare two configuration files
to see what changed? That is diff. Need to run a command on every file matching a
pattern? That is xargs.
This chapter is your reference and training ground for the essential text processing utilities. Master these tools and their combinations, and you will solve most data problems without ever writing a script.
Try This Right Now
# Create a sample data file
cat > /tmp/toolkit-data.txt << 'DATA'
banana
apple
cherry
banana
date
apple
elderberry
banana
fig
apple
cherry
date
grape
DATA
# Sort, count duplicates, show top 3
sort /tmp/toolkit-data.txt | uniq -c | sort -rn | head -3
# 3 banana
# 3 apple
# 2 cherry
# One-liner: find the 5 most common words in a file
tr -s '[:space:]' '\n' < /etc/services | tr '[:upper:]' '[:lower:]' | \
sort | uniq -c | sort -rn | head -5
sort: Ordering Lines
sort arranges lines in order. It is far more powerful than just alphabetical sorting.
# Alphabetical sort (default)
sort /tmp/toolkit-data.txt
# Reverse order
sort -r /tmp/toolkit-data.txt
# Numeric sort (-n)
echo -e "10\n2\n100\n20\n1" | sort -n
# 1 2 10 20 100
# Without -n, "10" comes before "2" (lexicographic)
echo -e "10\n2\n100\n20\n1" | sort
# 1 10 100 2 20
# Human-readable numeric sort (-h): handles K, M, G suffixes
echo -e "1G\n500M\n2G\n100K" | sort -h
# 100K 500M 1G 2G
# Sort by specific field
echo -e "Alice 30\nBob 25\nCharlie 35" | sort -k2 -n
# Bob 25 Alice 30 Charlie 35
# Sort by multiple keys
echo -e "A 3\nB 1\nA 1\nB 3" | sort -k1,1 -k2,2n
# A 1 A 3 B 1 B 3
# Remove duplicates while sorting (-u)
echo -e "banana\napple\nbanana\ncherry" | sort -u
# apple banana cherry
# Case-insensitive sort (-f)
echo -e "Banana\napple\nCherry" | sort -f
# apple Banana Cherry
# Sort CSV by 3rd column (comma-separated)
sort -t, -k3 -n file.csv
Key Specification: -k
The -k flag specifies which field to sort on:
# -k2 Sort on field 2 through end of line
# -k2,2 Sort on field 2 only
# -k2,2n Sort on field 2, numerically
# -k2,2nr Sort on field 2, numerically, reversed
# -k1,1 -k3,3n Sort on field 1 (alpha), then field 3 (numeric)
# Practical: sort /etc/passwd by UID (field 3)
sort -t: -k3,3n /etc/passwd | head -5
uniq: Removing Duplicates
uniq removes adjacent duplicate lines. This means you almost always need to
sort first.
# Remove adjacent duplicates (sort first!)
sort /tmp/toolkit-data.txt | uniq
# Count occurrences (-c)
sort /tmp/toolkit-data.txt | uniq -c
# 3 apple
# 3 banana
# 2 cherry
# 2 date
# 1 elderberry
# 1 fig
# 1 grape
# Show only duplicated lines (-d)
sort /tmp/toolkit-data.txt | uniq -d
# apple banana cherry date
# Show only unique lines (appearing exactly once) (-u)
sort /tmp/toolkit-data.txt | uniq -u
# elderberry fig grape
# Case-insensitive (-i)
echo -e "Apple\napple\nAPPLE" | sort | uniq -i
# Apple
The sort | uniq -c | sort -rn pattern is so common it deserves its own shorthand
in your memory:
# "Count and rank" pattern -- you will use this constantly
some_command | sort | uniq -c | sort -rn | head -10
cut: Extracting Columns
cut extracts specific columns or fields from each line.
# Extract by character position
echo "Hello World" | cut -c1-5
# Hello
echo "Hello World" | cut -c7-
# World
# Extract by delimiter and field
echo "root:x:0:0:root:/root:/bin/bash" | cut -d: -f1
# root
echo "root:x:0:0:root:/root:/bin/bash" | cut -d: -f1,7
# root:/bin/bash
echo "root:x:0:0:root:/root:/bin/bash" | cut -d: -f1,3-5
# root:0:0:root
# Extract from CSV
echo "Alice,30,Engineering" | cut -d, -f1,3
# Alice,Engineering
Practical Uses
# Get all usernames
cut -d: -f1 /etc/passwd
# Get all shells in use
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn
# Extract columns from space-delimited output
# (cut works poorly with multiple spaces -- use awk instead)
df -h | cut -c1-20,45-
Think About It: When would you choose
cutoverawkfor extracting fields? Hint: think about when the input is cleanly delimited versus when fields are separated by variable whitespace.
paste: Merging Lines Side by Side
paste joins lines from multiple files or merges consecutive lines.
# Merge two files side by side
echo -e "Alice\nBob\nCharlie" > /tmp/names.txt
echo -e "30\n25\n35" > /tmp/ages.txt
paste /tmp/names.txt /tmp/ages.txt
# Alice 30
# Bob 25
# Charlie 35
# Custom delimiter
paste -d, /tmp/names.txt /tmp/ages.txt
# Alice,30
# Bob,25
# Charlie,35
# Merge all lines into one (serial mode)
echo -e "one\ntwo\nthree" | paste -sd,
# one,two,three
# Merge every N lines (using - as stdin placeholder)
echo -e "1\n2\n3\n4\n5\n6" | paste - - -
# 1 2 3
# 4 5 6
rm /tmp/names.txt /tmp/ages.txt
tr: Translating Characters
tr translates (replaces) or deletes characters. It works on characters, not strings.
# Replace lowercase with uppercase
echo "hello world" | tr 'a-z' 'A-Z'
# HELLO WORLD
# Replace uppercase with lowercase
echo "HELLO WORLD" | tr 'A-Z' 'a-z'
# hello world
# Replace spaces with newlines (one word per line)
echo "one two three" | tr ' ' '\n'
# one
# two
# three
# Squeeze repeated characters (-s)
echo "hello world" | tr -s ' '
# hello world
# Delete characters (-d)
echo "Hello, World! 123" | tr -d '[:digit:]'
# Hello, World!
echo "Hello, World! 123" | tr -d '[:punct:]'
# Hello World 123
# Replace non-alphanumeric with underscores
echo "file name (2).txt" | tr -c '[:alnum:].\n' '_'
# file_name__2_.txt
# Squeeze multiple newlines into one
cat file_with_blanks.txt | tr -s '\n'
# Remove carriage returns (Windows line endings)
tr -d '\r' < windows-file.txt > unix-file.txt
Character Classes for tr
| Class | Characters |
|---|---|
[:alpha:] | Letters |
[:digit:] | Digits |
[:alnum:] | Letters and digits |
[:upper:] | Uppercase |
[:lower:] | Lowercase |
[:space:] | Whitespace |
[:punct:] | Punctuation |
wc: Counting
wc (word count) counts lines, words, and characters.
# All three counts
wc /etc/hosts
# 12 35 338 /etc/hosts
# lines words bytes
# Lines only (-l)
wc -l /etc/passwd
# 35 /etc/passwd
# Words only (-w)
wc -w /etc/hosts
# Characters only (-c for bytes, -m for characters)
wc -c /etc/hosts
wc -m /etc/hosts
# Count from a pipeline
ps aux | wc -l
# Multiple files
wc -l /etc/passwd /etc/group /etc/hosts
head and tail: Beginning and End
# First 10 lines (default)
head /etc/passwd
# First N lines
head -n 5 /etc/passwd
head -5 /etc/passwd # Shorthand
# All but the last N lines
head -n -5 /etc/passwd # Everything except last 5
# Last 10 lines (default)
tail /etc/passwd
# Last N lines
tail -n 5 /etc/passwd
tail -5 /etc/passwd # Shorthand
# Starting from line N
tail -n +5 /etc/passwd # From line 5 to end
# Follow a file in real time (-f)
tail -f /var/log/syslog
# Follow and retry if file is recreated (-F)
tail -F /var/log/nginx/access.log
Extracting a Range of Lines
# Lines 10-20 of a file
sed -n '10,20p' file
# Or with head and tail
head -20 file | tail -11
tee: Split Output
tee writes output to both stdout and one or more files:
# Save output while also displaying it
ls -la /etc | tee /tmp/etc-listing.txt
# Append instead of overwrite
echo "new entry" | tee -a /tmp/log.txt
# Write to multiple files
echo "data" | tee file1.txt file2.txt file3.txt
# Use in a pipeline (save intermediate results)
ps aux | tee /tmp/all-processes.txt | grep nginx | tee /tmp/nginx-processes.txt | wc -l
diff: Comparing Files
diff shows the differences between two files.
# Create two similar files
echo -e "line 1\nline 2\nline 3" > /tmp/file1.txt
echo -e "line 1\nline TWO\nline 3\nline 4" > /tmp/file2.txt
# Normal diff
diff /tmp/file1.txt /tmp/file2.txt
# 2c2
# < line 2
# ---
# > line TWO
# 3a4
# > line 4
# Unified diff (-u) -- most readable format
diff -u /tmp/file1.txt /tmp/file2.txt
# --- /tmp/file1.txt
# +++ /tmp/file2.txt
# @@ -1,3 +1,4 @@
# line 1
# -line 2
# +line TWO
# line 3
# +line 4
# Side-by-side (-y)
diff -y /tmp/file1.txt /tmp/file2.txt
# line 1 line 1
# line 2 | line TWO
# line 3 line 3
# > line 4
# Just tell me if they differ (exit code)
diff -q /tmp/file1.txt /tmp/file2.txt
# Files /tmp/file1.txt and /tmp/file2.txt differ
# Recursive diff on directories
diff -r /etc/ssh/ /tmp/ssh-backup/
# Color diff (if available)
diff --color /tmp/file1.txt /tmp/file2.txt
rm /tmp/file1.txt /tmp/file2.txt
comm: Compare Sorted Files
comm compares two sorted files and shows three columns:
- Lines only in file 1
- Lines only in file 2
- Lines in both files
echo -e "apple\nbanana\ncherry" > /tmp/a.txt
echo -e "banana\ncherry\ndate" > /tmp/b.txt
comm /tmp/a.txt /tmp/b.txt
# apple
# banana
# cherry
# date
# Show only lines unique to file 1
comm -23 /tmp/a.txt /tmp/b.txt
# apple
# Show only lines unique to file 2
comm -13 /tmp/a.txt /tmp/b.txt
# date
# Show only lines in common
comm -12 /tmp/a.txt /tmp/b.txt
# banana
# cherry
rm /tmp/a.txt /tmp/b.txt
join: Database-Style Joins
join merges two sorted files on a common field, like an SQL JOIN:
echo -e "1 Alice\n2 Bob\n3 Charlie" > /tmp/users.txt
echo -e "1 Engineering\n2 Marketing\n3 Engineering" > /tmp/depts.txt
join /tmp/users.txt /tmp/depts.txt
# 1 Alice Engineering
# 2 Bob Marketing
# 3 Charlie Engineering
# Join on different fields
echo -e "Alice 1\nBob 2\nCharlie 3" > /tmp/users2.txt
join -1 2 -2 1 /tmp/users2.txt /tmp/depts.txt
# 1 Alice Engineering
# 2 Bob Marketing
# 3 Charlie Engineering
rm /tmp/users.txt /tmp/depts.txt /tmp/users2.txt
xargs: Building Commands from Input
xargs reads items from stdin and passes them as arguments to a command. It is the
bridge between output and execution.
# Basic: pass lines as arguments
echo -e "file1.txt\nfile2.txt\nfile3.txt" | xargs ls -l
# Find and delete (safer than find -delete)
find /tmp -name "*.bak" -print | xargs rm -v
# Null-delimited (handles spaces in filenames)
find /tmp -name "*.log" -print0 | xargs -0 ls -l
# Run command for each item individually (-I)
echo -e "alice\nbob\ncharlie" | xargs -I {} echo "Hello, {}!"
# Hello, alice!
# Hello, bob!
# Hello, charlie!
# Limit number of arguments per command (-n)
echo -e "1\n2\n3\n4\n5\n6" | xargs -n 2 echo
# 1 2
# 3 4
# 5 6
# Parallel execution (-P)
echo -e "1\n2\n3\n4" | xargs -P 4 -I {} sh -c 'sleep 1; echo "Done: {}"'
# All four complete in ~1 second instead of ~4
# Prompt before executing (-p)
echo "important-file.txt" | xargs -p rm
# Practical: grep across files found by find
find /etc -name "*.conf" -print0 2>/dev/null | xargs -0 grep -l "port" 2>/dev/null
WARNING: Without
-print0and-0,xargsbreaks on filenames with spaces, quotes, or backslashes. Always use the null-delimiter pair for robustness.
Hands-On: Combining Tools
The real power emerges when you combine these tools in pipelines.
Setup
cat > /tmp/weblog.txt << 'LOG'
10.0.0.1 GET /index.html 200 1234
10.0.0.2 POST /api/login 200 567
10.0.0.1 GET /api/users 200 8901
10.0.0.3 GET /index.html 200 1234
10.0.0.2 GET /api/users 404 123
10.0.0.1 POST /api/orders 500 234
10.0.0.4 GET /index.html 200 1234
10.0.0.1 GET /api/users 200 8901
10.0.0.2 GET /api/products 200 5678
10.0.0.3 POST /api/orders 201 890
10.0.0.1 GET /api/users 200 8901
10.0.0.5 GET /index.html 200 1234
10.0.0.2 DELETE /api/users/42 403 98
10.0.0.1 GET /favicon.ico 404 0
LOG
Pipeline 1: Top 5 IP Addresses by Request Count
awk '{print $1}' /tmp/weblog.txt | sort | uniq -c | sort -rn | head -5
Output:
6 10.0.0.1
4 10.0.0.2
2 10.0.0.3
1 10.0.0.5
1 10.0.0.4
Pipeline 2: Most Requested Endpoints
awk '{print $3}' /tmp/weblog.txt | sort | uniq -c | sort -rn
Pipeline 3: Error Requests (4xx and 5xx)
awk '$4 >= 400 {print $4, $1, $2, $3}' /tmp/weblog.txt | sort
Pipeline 4: Total Bytes Transferred by Endpoint
awk '{bytes[$3] += $5} END {for(ep in bytes) print bytes[ep], ep}' /tmp/weblog.txt | sort -rn
Pipeline 5: Unique IPs per Endpoint
awk '{print $3, $1}' /tmp/weblog.txt | sort -u | awk '{print $1}' | sort | uniq -c | sort -rn
Pipeline 6: Find Large Files and Their Total Size
find /var/log -type f -name "*.log" 2>/dev/null | xargs du -sh 2>/dev/null | sort -rh | head -10
Think About It: Look at Pipeline 5. Why do we need two
sortcommands? What would happen if we removed the firstsort -u?
Debug This: Pipeline Producing Wrong Results
You try to count unique users in /etc/passwd:
cut -d: -f7 /etc/passwd | uniq -c
The output shows every shell with a count of 1, which is wrong. You know /bin/bash
appears multiple times.
Problem: uniq only removes adjacent duplicates. Without sorting first, it
compares each line only to the previous line.
Fix:
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn
Now you see the correct counts.
Quick Reference
+------------------------------------------------------------------+
| TOOL | PURPOSE | KEY FLAGS |
+------------------------------------------------------------------+
| sort | Order lines | -n, -r, -k, -t, -u, -h |
| uniq | Deduplicate | -c, -d, -u, -i |
| cut | Extract columns | -d, -f, -c |
| paste | Merge lines | -d, -s |
| tr | Translate chars | -d, -s, -c |
| wc | Count | -l, -w, -c, -m |
| head | First N lines | -n |
| tail | Last N lines | -n, -f, -F, +N |
| tee | Split output | -a |
| diff | Compare files | -u, -y, -r, -q |
| comm | Compare sorted | -1, -2, -3, -12, -23, -13 |
| join | Merge on field | -1, -2, -t |
| xargs | Build commands | -I, -0, -n, -P, -p |
+------------------------------------------------------------------+
What Just Happened?
+------------------------------------------------------------------+
| CHAPTER 23 RECAP |
+------------------------------------------------------------------+
| |
| - sort | uniq -c | sort -rn is the "count and rank" pattern |
| - cut extracts columns by delimiter; awk handles variable |
| whitespace better |
| - tr translates or deletes characters (not strings) |
| - diff -u shows differences in unified format |
| - xargs converts stdin to command arguments |
| - Always use find -print0 | xargs -0 for safe file handling |
| - tee saves output while passing it through the pipeline |
| - paste merges files or lines side by side |
| - comm compares sorted files (unique to each, common to both) |
| - The real power is in combining tools with pipes |
| |
+------------------------------------------------------------------+
Try This
Exercise 1: Word Frequency
Take any text file (like /usr/share/common-licenses/GPL-3 if available, or download
one) and find the 20 most frequently used words. Use tr, sort, uniq, and head.
Exercise 2: Log Analysis Pipeline
Using /tmp/weblog.txt:
- Find the IP that made the most POST requests
- Find the endpoint with the highest error rate (4xx/5xx)
- Calculate the average bytes per request
Exercise 3: Comparing Configurations
Copy /etc/ssh/sshd_config to /tmp/sshd_config_modified. Make three changes to the
copy (uncomment a line, change a value, add a new line). Use diff -u to create a
patch, then explore comm to see the differences.
Exercise 4: Batch Operations with xargs
Find all .conf files under /etc and use xargs to count the total number of
non-comment, non-empty lines across all of them.
find /etc -name "*.conf" -print0 2>/dev/null | \
xargs -0 grep -v '^#' 2>/dev/null | \
grep -v '^$' | wc -l
Bonus Challenge
Write a single pipeline (no scripts, no temporary files) that reads /etc/passwd and
produces a formatted table showing: shells in the first column, the count of users per
shell in the second column, and the usernames in the third column (comma-separated).
Sort by count, descending. This combines cut, sort, awk, paste, and more.