Cut, Sort, and Uniq
Extract columns with cut, sort data numerically and by key with sort, count duplicates with uniq -c, and compose multi-step pipelines.
- Extract specific columns from delimited files with cut
- Sort lines alphabetically, numerically, by column, and in reverse
- Count and deduplicate lines with uniq and uniq -c
- Compose cut, sort, and uniq into multi-step analysis pipelines
cut, sort, and uniq each do one small job — and in combination, they solve a large class of text analysis problems. They complement grep, sed, and awk by being simpler and faster for the cases they're designed for: extracting fixed columns, ordering lines, and counting duplicates. Mastering the combination is the difference between a script that builds a result line by line and one that expresses the entire computation as a single, readable pipeline.
cut: extract columns
cut extracts columns from structured text. The two most common uses are cutting by character position and cutting by delimiter:
# By delimiter: -d specifies the delimiter, -f the field number(s)
cut -d: -f1 /etc/passwd # first field (username)
cut -d: -f1,7 /etc/passwd # first and seventh fields
cut -d, -f2-4 data.csv # fields 2 through 4 from a CSV
# By character position
cut -c1-10 file.txt # first 10 characters of each line
cut -c5- file.txt # from character 5 to end of linecut is faster than awk for simple column extraction and its syntax makes the intent instantly clear.
cut only supports single-character delimiters. If your data uses a multi-character delimiter (like , or ::) you need awk (-F::). Also, cut does not re-order fields — it always outputs them left to right regardless of how you list them in -f. Use awk for reordering.
sort: order lines
sort by default sorts lexicographically (alphabetical, treating everything as text):
sort names.txt # ascending alphabetical
sort -r names.txt # descending
sort -u names.txt # sort and remove duplicates (unique)For numeric and structured data, the flags matter:
sort -n numbers.txt # numeric sort (10 > 9, not "1" < "9")
sort -rn numbers.txt # numeric, descending
# Sort by a specific column: -k col.char,col.char
sort -k2 data.txt # sort by second whitespace-delimited field
sort -k2,2n data.txt # sort by second field numerically
sort -t: -k3,3n /etc/passwd # colon-delimited, sort by uid (field 3)The -k flag takes a start and end position: -k2,2n means "field 2, start of field to end of field, numeric". Without the ,2 end specifier, sort treats everything from field 2 to end of line as the sort key.
uniq: count and deduplicate
uniq removes or counts consecutive duplicate lines. Because it only looks at adjacent lines, it almost always follows sort:
sort names.txt | uniq # remove duplicates
sort names.txt | uniq -c # count occurrences
sort names.txt | uniq -d # print only duplicate lines
sort names.txt | uniq -u # print only unique (non-duplicate) linesThe most common pattern is sort | uniq -c | sort -rn — count occurrences and rank by frequency:
# Most common HTTP status codes in an access log
awk '{print $9}' access.log | sort | uniq -c | sort -rn | head -10
# Most common words in a file
tr '[:upper:]' '[:lower:]' < essay.txt | tr -cs '[:alpha:]' '\n' | \
sort | uniq -c | sort -rn | head -20Composing multi-step pipelines
These three tools combine into a standard idiom for data exploration:
# Top 5 users by number of processes
ps aux | awk '{print $1}' | sort | uniq -c | sort -rn | head -5
# Files modified in the last 7 days, sorted by size (largest first)
find . -mtime -7 -type f | xargs ls -s 2>/dev/null | sort -rn | head -10
# Unique IP addresses accessing a web server
cut -d' ' -f1 access.log | sort -u
# Distribution of file extensions under src/
find src/ -type f | sed 's/.*\.//' | sort | uniq -c | sort -rnThe standard pipeline structure for analysis: extract field → sort → uniq -c → sort -rn → head -N. Once you internalize this, many data questions become a matter of plugging in the right extraction step at the front.
Check your understanding
- 1.You want the third field from a colon-delimited file. Which command is correct?
- 2.Why does sort | uniq -c produce correct results, but uniq -c alone on an unsorted file may not?
- 3.Sorting a file of numbers with sort (without -n) will correctly order 2, 10, 20 as 2, 10, 20.
Do it yourself
# Extract unique shells from /etc/passwd and count each
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn
# Simulate a log and find top "IPs"
printf "10.0.0.1\n10.0.0.2\n10.0.0.1\n10.0.0.3\n10.0.0.1\n10.0.0.2\n" | \
sort | uniq -c | sort -rn
# Sort /etc/passwd by UID numerically (field 3)
sort -t: -k3,3n /etc/passwd | cut -d: -f1,3 | head -10Where to go next
You've completed the Text processing module — grep, sed, awk, cut, sort, and uniq. The lab is next for hands-on reinforcement, then the Advanced tier opens up: arrays, parameter expansion, heredocs, traps, and automation tools that put everything together in production-grade scripts.