Text Processing Toolkit

Why This Matters

The power of Linux lies not in any single tool, but in how tools combine. Each utility in this chapter does one thing well: sort sorts, uniq deduplicates, cut extracts columns, tr translates characters. Alone, each is simple. Piped together, they become a data processing pipeline that can rival purpose-built programs.

Need to find the top 10 most active IP addresses in a web server log? That is awk, sort, uniq -c, and head piped together. Need to compare two configuration files to see what changed? That is diff. Need to run a command on every file matching a pattern? That is xargs.

This chapter is your reference and training ground for the essential text processing utilities. Master these tools and their combinations, and you will solve most data problems without ever writing a script.


Try This Right Now

# Create a sample data file
cat > /tmp/toolkit-data.txt << 'DATA'
banana
apple
cherry
banana
date
apple
elderberry
banana
fig
apple
cherry
date
grape
DATA

# Sort, count duplicates, show top 3
sort /tmp/toolkit-data.txt | uniq -c | sort -rn | head -3
#   3 banana
#   3 apple
#   2 cherry

# One-liner: find the 5 most common words in a file
tr -s '[:space:]' '\n' < /etc/services | tr '[:upper:]' '[:lower:]' | \
    sort | uniq -c | sort -rn | head -5

sort: Ordering Lines

sort arranges lines in order. It is far more powerful than just alphabetical sorting.

# Alphabetical sort (default)
sort /tmp/toolkit-data.txt

# Reverse order
sort -r /tmp/toolkit-data.txt

# Numeric sort (-n)
echo -e "10\n2\n100\n20\n1" | sort -n
# 1  2  10  20  100

# Without -n, "10" comes before "2" (lexicographic)
echo -e "10\n2\n100\n20\n1" | sort
# 1  10  100  2  20

# Human-readable numeric sort (-h): handles K, M, G suffixes
echo -e "1G\n500M\n2G\n100K" | sort -h
# 100K  500M  1G  2G

# Sort by specific field
echo -e "Alice 30\nBob 25\nCharlie 35" | sort -k2 -n
# Bob 25  Alice 30  Charlie 35

# Sort by multiple keys
echo -e "A 3\nB 1\nA 1\nB 3" | sort -k1,1 -k2,2n
# A 1  A 3  B 1  B 3

# Remove duplicates while sorting (-u)
echo -e "banana\napple\nbanana\ncherry" | sort -u
# apple  banana  cherry

# Case-insensitive sort (-f)
echo -e "Banana\napple\nCherry" | sort -f
# apple  Banana  Cherry

# Sort CSV by 3rd column (comma-separated)
sort -t, -k3 -n file.csv

Key Specification: -k

The -k flag specifies which field to sort on:

# -k2          Sort on field 2 through end of line
# -k2,2        Sort on field 2 only
# -k2,2n       Sort on field 2, numerically
# -k2,2nr      Sort on field 2, numerically, reversed
# -k1,1 -k3,3n Sort on field 1 (alpha), then field 3 (numeric)
# Practical: sort /etc/passwd by UID (field 3)
sort -t: -k3,3n /etc/passwd | head -5

uniq: Removing Duplicates

uniq removes adjacent duplicate lines. This means you almost always need to sort first.

# Remove adjacent duplicates (sort first!)
sort /tmp/toolkit-data.txt | uniq

# Count occurrences (-c)
sort /tmp/toolkit-data.txt | uniq -c
#   3 apple
#   3 banana
#   2 cherry
#   2 date
#   1 elderberry
#   1 fig
#   1 grape

# Show only duplicated lines (-d)
sort /tmp/toolkit-data.txt | uniq -d
# apple  banana  cherry  date

# Show only unique lines (appearing exactly once) (-u)
sort /tmp/toolkit-data.txt | uniq -u
# elderberry  fig  grape

# Case-insensitive (-i)
echo -e "Apple\napple\nAPPLE" | sort | uniq -i
# Apple

The sort | uniq -c | sort -rn pattern is so common it deserves its own shorthand in your memory:

# "Count and rank" pattern -- you will use this constantly
some_command | sort | uniq -c | sort -rn | head -10

cut: Extracting Columns

cut extracts specific columns or fields from each line.

# Extract by character position
echo "Hello World" | cut -c1-5
# Hello

echo "Hello World" | cut -c7-
# World

# Extract by delimiter and field
echo "root:x:0:0:root:/root:/bin/bash" | cut -d: -f1
# root

echo "root:x:0:0:root:/root:/bin/bash" | cut -d: -f1,7
# root:/bin/bash

echo "root:x:0:0:root:/root:/bin/bash" | cut -d: -f1,3-5
# root:0:0:root

# Extract from CSV
echo "Alice,30,Engineering" | cut -d, -f1,3
# Alice,Engineering

Practical Uses

# Get all usernames
cut -d: -f1 /etc/passwd

# Get all shells in use
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn

# Extract columns from space-delimited output
# (cut works poorly with multiple spaces -- use awk instead)
df -h | cut -c1-20,45-

Think About It: When would you choose cut over awk for extracting fields? Hint: think about when the input is cleanly delimited versus when fields are separated by variable whitespace.


paste: Merging Lines Side by Side

paste joins lines from multiple files or merges consecutive lines.

# Merge two files side by side
echo -e "Alice\nBob\nCharlie" > /tmp/names.txt
echo -e "30\n25\n35" > /tmp/ages.txt
paste /tmp/names.txt /tmp/ages.txt
# Alice	30
# Bob	25
# Charlie	35

# Custom delimiter
paste -d, /tmp/names.txt /tmp/ages.txt
# Alice,30
# Bob,25
# Charlie,35

# Merge all lines into one (serial mode)
echo -e "one\ntwo\nthree" | paste -sd,
# one,two,three

# Merge every N lines (using - as stdin placeholder)
echo -e "1\n2\n3\n4\n5\n6" | paste - - -
# 1	2	3
# 4	5	6

rm /tmp/names.txt /tmp/ages.txt

tr: Translating Characters

tr translates (replaces) or deletes characters. It works on characters, not strings.

# Replace lowercase with uppercase
echo "hello world" | tr 'a-z' 'A-Z'
# HELLO WORLD

# Replace uppercase with lowercase
echo "HELLO WORLD" | tr 'A-Z' 'a-z'
# hello world

# Replace spaces with newlines (one word per line)
echo "one two three" | tr ' ' '\n'
# one
# two
# three

# Squeeze repeated characters (-s)
echo "hello     world" | tr -s ' '
# hello world

# Delete characters (-d)
echo "Hello, World! 123" | tr -d '[:digit:]'
# Hello, World!

echo "Hello, World! 123" | tr -d '[:punct:]'
# Hello World 123

# Replace non-alphanumeric with underscores
echo "file name (2).txt" | tr -c '[:alnum:].\n' '_'
# file_name__2_.txt

# Squeeze multiple newlines into one
cat file_with_blanks.txt | tr -s '\n'

# Remove carriage returns (Windows line endings)
tr -d '\r' < windows-file.txt > unix-file.txt

Character Classes for tr

ClassCharacters
[:alpha:]Letters
[:digit:]Digits
[:alnum:]Letters and digits
[:upper:]Uppercase
[:lower:]Lowercase
[:space:]Whitespace
[:punct:]Punctuation

wc: Counting

wc (word count) counts lines, words, and characters.

# All three counts
wc /etc/hosts
#   12   35  338 /etc/hosts
#  lines words bytes

# Lines only (-l)
wc -l /etc/passwd
# 35 /etc/passwd

# Words only (-w)
wc -w /etc/hosts

# Characters only (-c for bytes, -m for characters)
wc -c /etc/hosts
wc -m /etc/hosts

# Count from a pipeline
ps aux | wc -l

# Multiple files
wc -l /etc/passwd /etc/group /etc/hosts

head and tail: Beginning and End

# First 10 lines (default)
head /etc/passwd

# First N lines
head -n 5 /etc/passwd
head -5 /etc/passwd          # Shorthand

# All but the last N lines
head -n -5 /etc/passwd       # Everything except last 5

# Last 10 lines (default)
tail /etc/passwd

# Last N lines
tail -n 5 /etc/passwd
tail -5 /etc/passwd          # Shorthand

# Starting from line N
tail -n +5 /etc/passwd       # From line 5 to end

# Follow a file in real time (-f)
tail -f /var/log/syslog

# Follow and retry if file is recreated (-F)
tail -F /var/log/nginx/access.log

Extracting a Range of Lines

# Lines 10-20 of a file
sed -n '10,20p' file

# Or with head and tail
head -20 file | tail -11

tee: Split Output

tee writes output to both stdout and one or more files:

# Save output while also displaying it
ls -la /etc | tee /tmp/etc-listing.txt

# Append instead of overwrite
echo "new entry" | tee -a /tmp/log.txt

# Write to multiple files
echo "data" | tee file1.txt file2.txt file3.txt

# Use in a pipeline (save intermediate results)
ps aux | tee /tmp/all-processes.txt | grep nginx | tee /tmp/nginx-processes.txt | wc -l

diff: Comparing Files

diff shows the differences between two files.

# Create two similar files
echo -e "line 1\nline 2\nline 3" > /tmp/file1.txt
echo -e "line 1\nline TWO\nline 3\nline 4" > /tmp/file2.txt

# Normal diff
diff /tmp/file1.txt /tmp/file2.txt
# 2c2
# < line 2
# ---
# > line TWO
# 3a4
# > line 4

# Unified diff (-u) -- most readable format
diff -u /tmp/file1.txt /tmp/file2.txt
# --- /tmp/file1.txt
# +++ /tmp/file2.txt
# @@ -1,3 +1,4 @@
#  line 1
# -line 2
# +line TWO
#  line 3
# +line 4

# Side-by-side (-y)
diff -y /tmp/file1.txt /tmp/file2.txt
# line 1            line 1
# line 2          | line TWO
# line 3            line 3
#                 > line 4

# Just tell me if they differ (exit code)
diff -q /tmp/file1.txt /tmp/file2.txt
# Files /tmp/file1.txt and /tmp/file2.txt differ

# Recursive diff on directories
diff -r /etc/ssh/ /tmp/ssh-backup/

# Color diff (if available)
diff --color /tmp/file1.txt /tmp/file2.txt

rm /tmp/file1.txt /tmp/file2.txt

comm: Compare Sorted Files

comm compares two sorted files and shows three columns:

  1. Lines only in file 1
  2. Lines only in file 2
  3. Lines in both files
echo -e "apple\nbanana\ncherry" > /tmp/a.txt
echo -e "banana\ncherry\ndate" > /tmp/b.txt

comm /tmp/a.txt /tmp/b.txt
# apple
# 		banana
# 		cherry
# 	date

# Show only lines unique to file 1
comm -23 /tmp/a.txt /tmp/b.txt
# apple

# Show only lines unique to file 2
comm -13 /tmp/a.txt /tmp/b.txt
# date

# Show only lines in common
comm -12 /tmp/a.txt /tmp/b.txt
# banana
# cherry

rm /tmp/a.txt /tmp/b.txt

join: Database-Style Joins

join merges two sorted files on a common field, like an SQL JOIN:

echo -e "1 Alice\n2 Bob\n3 Charlie" > /tmp/users.txt
echo -e "1 Engineering\n2 Marketing\n3 Engineering" > /tmp/depts.txt

join /tmp/users.txt /tmp/depts.txt
# 1 Alice Engineering
# 2 Bob Marketing
# 3 Charlie Engineering

# Join on different fields
echo -e "Alice 1\nBob 2\nCharlie 3" > /tmp/users2.txt
join -1 2 -2 1 /tmp/users2.txt /tmp/depts.txt
# 1 Alice Engineering
# 2 Bob Marketing
# 3 Charlie Engineering

rm /tmp/users.txt /tmp/depts.txt /tmp/users2.txt

xargs: Building Commands from Input

xargs reads items from stdin and passes them as arguments to a command. It is the bridge between output and execution.

# Basic: pass lines as arguments
echo -e "file1.txt\nfile2.txt\nfile3.txt" | xargs ls -l

# Find and delete (safer than find -delete)
find /tmp -name "*.bak" -print | xargs rm -v

# Null-delimited (handles spaces in filenames)
find /tmp -name "*.log" -print0 | xargs -0 ls -l

# Run command for each item individually (-I)
echo -e "alice\nbob\ncharlie" | xargs -I {} echo "Hello, {}!"
# Hello, alice!
# Hello, bob!
# Hello, charlie!

# Limit number of arguments per command (-n)
echo -e "1\n2\n3\n4\n5\n6" | xargs -n 2 echo
# 1 2
# 3 4
# 5 6

# Parallel execution (-P)
echo -e "1\n2\n3\n4" | xargs -P 4 -I {} sh -c 'sleep 1; echo "Done: {}"'
# All four complete in ~1 second instead of ~4

# Prompt before executing (-p)
echo "important-file.txt" | xargs -p rm

# Practical: grep across files found by find
find /etc -name "*.conf" -print0 2>/dev/null | xargs -0 grep -l "port" 2>/dev/null

WARNING: Without -print0 and -0, xargs breaks on filenames with spaces, quotes, or backslashes. Always use the null-delimiter pair for robustness.


Hands-On: Combining Tools

The real power emerges when you combine these tools in pipelines.

Setup

cat > /tmp/weblog.txt << 'LOG'
10.0.0.1 GET /index.html 200 1234
10.0.0.2 POST /api/login 200 567
10.0.0.1 GET /api/users 200 8901
10.0.0.3 GET /index.html 200 1234
10.0.0.2 GET /api/users 404 123
10.0.0.1 POST /api/orders 500 234
10.0.0.4 GET /index.html 200 1234
10.0.0.1 GET /api/users 200 8901
10.0.0.2 GET /api/products 200 5678
10.0.0.3 POST /api/orders 201 890
10.0.0.1 GET /api/users 200 8901
10.0.0.5 GET /index.html 200 1234
10.0.0.2 DELETE /api/users/42 403 98
10.0.0.1 GET /favicon.ico 404 0
LOG

Pipeline 1: Top 5 IP Addresses by Request Count

awk '{print $1}' /tmp/weblog.txt | sort | uniq -c | sort -rn | head -5

Output:

      6 10.0.0.1
      4 10.0.0.2
      2 10.0.0.3
      1 10.0.0.5
      1 10.0.0.4

Pipeline 2: Most Requested Endpoints

awk '{print $3}' /tmp/weblog.txt | sort | uniq -c | sort -rn

Pipeline 3: Error Requests (4xx and 5xx)

awk '$4 >= 400 {print $4, $1, $2, $3}' /tmp/weblog.txt | sort

Pipeline 4: Total Bytes Transferred by Endpoint

awk '{bytes[$3] += $5} END {for(ep in bytes) print bytes[ep], ep}' /tmp/weblog.txt | sort -rn

Pipeline 5: Unique IPs per Endpoint

awk '{print $3, $1}' /tmp/weblog.txt | sort -u | awk '{print $1}' | sort | uniq -c | sort -rn

Pipeline 6: Find Large Files and Their Total Size

find /var/log -type f -name "*.log" 2>/dev/null | xargs du -sh 2>/dev/null | sort -rh | head -10

Think About It: Look at Pipeline 5. Why do we need two sort commands? What would happen if we removed the first sort -u?


Debug This: Pipeline Producing Wrong Results

You try to count unique users in /etc/passwd:

cut -d: -f7 /etc/passwd | uniq -c

The output shows every shell with a count of 1, which is wrong. You know /bin/bash appears multiple times.

Problem: uniq only removes adjacent duplicates. Without sorting first, it compares each line only to the previous line.

Fix:

cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn

Now you see the correct counts.


Quick Reference

+------------------------------------------------------------------+
|  TOOL        | PURPOSE            | KEY FLAGS                     |
+------------------------------------------------------------------+
|  sort        | Order lines        | -n, -r, -k, -t, -u, -h      |
|  uniq        | Deduplicate        | -c, -d, -u, -i               |
|  cut         | Extract columns    | -d, -f, -c                   |
|  paste       | Merge lines        | -d, -s                        |
|  tr          | Translate chars    | -d, -s, -c                   |
|  wc          | Count              | -l, -w, -c, -m               |
|  head        | First N lines      | -n                             |
|  tail        | Last N lines       | -n, -f, -F, +N                |
|  tee         | Split output       | -a                             |
|  diff        | Compare files      | -u, -y, -r, -q               |
|  comm        | Compare sorted     | -1, -2, -3, -12, -23, -13   |
|  join        | Merge on field     | -1, -2, -t                    |
|  xargs       | Build commands     | -I, -0, -n, -P, -p           |
+------------------------------------------------------------------+

What Just Happened?

+------------------------------------------------------------------+
|                     CHAPTER 23 RECAP                              |
+------------------------------------------------------------------+
|                                                                  |
|  - sort | uniq -c | sort -rn is the "count and rank" pattern    |
|  - cut extracts columns by delimiter; awk handles variable      |
|    whitespace better                                             |
|  - tr translates or deletes characters (not strings)            |
|  - diff -u shows differences in unified format                   |
|  - xargs converts stdin to command arguments                     |
|  - Always use find -print0 | xargs -0 for safe file handling   |
|  - tee saves output while passing it through the pipeline       |
|  - paste merges files or lines side by side                     |
|  - comm compares sorted files (unique to each, common to both)  |
|  - The real power is in combining tools with pipes              |
|                                                                  |
+------------------------------------------------------------------+

Try This

Exercise 1: Word Frequency

Take any text file (like /usr/share/common-licenses/GPL-3 if available, or download one) and find the 20 most frequently used words. Use tr, sort, uniq, and head.

Exercise 2: Log Analysis Pipeline

Using /tmp/weblog.txt:

  • Find the IP that made the most POST requests
  • Find the endpoint with the highest error rate (4xx/5xx)
  • Calculate the average bytes per request

Exercise 3: Comparing Configurations

Copy /etc/ssh/sshd_config to /tmp/sshd_config_modified. Make three changes to the copy (uncomment a line, change a value, add a new line). Use diff -u to create a patch, then explore comm to see the differences.

Exercise 4: Batch Operations with xargs

Find all .conf files under /etc and use xargs to count the total number of non-comment, non-empty lines across all of them.

find /etc -name "*.conf" -print0 2>/dev/null | \
    xargs -0 grep -v '^#' 2>/dev/null | \
    grep -v '^$' | wc -l

Bonus Challenge

Write a single pipeline (no scripts, no temporary files) that reads /etc/passwd and produces a formatted table showing: shells in the first column, the count of users per shell in the second column, and the usernames in the third column (comma-separated). Sort by count, descending. This combines cut, sort, awk, paste, and more.