Lab: Text processing
Hands-on quiz challenges covering regex flavours, sed address ranges, awk field splitting, and pipeline composition.
- Identify correct grep regex flavours and flags
- Reason about sed address ranges and in-place editing portability
- Predict awk field-splitting and NR/NF behaviour
- Compose multi-step pipelines for common analysis tasks
This lab consolidates the Text processing module. Work through the questions, then practise the pipeline challenges in a real terminal.
grep and regular expressions
- 1.Which grep command finds lines that contain exactly one digit followed by a letter, using ERE?
- 2.grep -v "ERROR" app.log prints only the lines that do NOT contain "ERROR".
- 3.You run grep -r "secret" . in a large repo. It is slow and prints many binary file matches. What is the best fix?
sed address ranges and in-place editing
- 1.The sed command sed "/START/,/END/d" file.txt does what?
- 2.Which sed -i invocation works on BOTH GNU/Linux and macOS?
- 3.sed -n "10,20p" file.txt prints lines 10 through 20 and also prints all other lines.
awk field splitting and logic
- 1.awk -F, "{print $2}" data.csv — if a field value is "John, Jr.", how many fields does awk see on that line?
- 2.What does awk "END { print NR }" file.txt print?
Pipeline composition
- 1.You want to find the 3 most common values in the first column of a space-delimited log file. Which pipeline is correct?
- 2.sort -u is equivalent to sort | uniq in all cases.
Do it yourself
Work through these pipeline challenges in your terminal:
# Challenge 1: top 5 shells used in /etc/passwd
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn | head -5
# Challenge 2: find all unique words starting with a capital letter in /etc/hosts
grep -oE '[A-Z][a-zA-Z]+' /etc/hosts | sort -u
# Challenge 3: list the 3 largest directories under /usr (by number of files)
find /usr -maxdepth 2 -type d 2>/dev/null | \
while IFS= read -r d; do
count=$(find "$d" -maxdepth 1 -type f 2>/dev/null | wc -l)
echo "$count $d"
done | sort -rn | head -3
# Challenge 4: extract the port number from each line of /etc/services (field 2, then take digits before /)
head -20 /etc/services | grep -v "^#" | awk '{print $2}' | cut -d/ -f1 | sort -nuWhere to go next
You've mastered the core Unix text-processing toolkit. The Advanced tier is next: arrays, parameter expansion, heredocs, traps, and automation topics like cron, Makefiles, CI scripts, and debugging techniques.
Cut, Sort, and Uniq
Extract columns with cut, sort data numerically and by key with sort, count duplicates with uniq -c, and compose multi-step pipelines.
Arrays
Use Bash indexed arrays and associative arrays — creating, expanding, slicing, looping, and understanding when arrays beat space-separated strings.