Lab: real-world applications
Four exercises on realistic messy text — log error extraction, URL parsing, price normalisation, and pattern decomposition.
- Extract 4xx errors with timestamps and paths from a server log
- Parse href URLs from an HTML snippet using regex (and explain the parser alternative)
- Normalise price strings to floats using a pipeline approach
- Decompose a complex single pattern into two sequential simpler ones
Optional lab. These exercises work with the kind of messy, inconsistent real-world text that automated agents and scripts encounter every day. Each checkpoint includes a solution and explanation. Try to write the solution yourself before reading ahead.
Warm up — anatomy of a log line
Before writing any patterns, explore what the data looks like:
Checkpoint 1 — extract 4xx errors
Extract all lines with 4xx status codes. For each, return an object with
{ timestamp, method, path, status }. The timestamp is the content inside […].
Write extractErrors(logs) that takes an array of Apache log line strings and returns an array of objects { timestamp, method, path, status } for every line with a 4xx status code (400-499).
extractErrors([...]) on a 401 line → [{ timestamp: '12/Jun/2024:08:00:02 +0000', method: 'POST', path: '/login', status: '401' }]Checkpoint 2 — extract href URLs from HTML
Extract all href attribute values from anchor tags in an HTML snippet. Then
explain (in a comment) why you would use a real parser in production.
Write extractHrefs(html) that returns an array of all href attribute values found in anchor tags. Handle both double-quoted and single-quoted values. Return an empty array if none found. (This works for simple inputs — the next step in production is a real HTML parser.)
extractHrefs('<a href="https://example.com">link</a>') → ['https://example.com']The exercise above works for simple, well-formed HTML. In production, it will
miss: href values with spaces before =, attributes on multiple lines,
values containing escaped quotes, and links inside HTML comments. For any real
HTML document, use DOMParser (browser) or cheerio/node-html-parser
(Node.js). The regex version is acceptable for controlled, machine-generated
output where you own the format.
Checkpoint 3 — price normalisation pipeline
Extract all price-like strings from a product catalogue and normalise them to
floats. Prices appear in formats like $12.99, $7.50, $24, 12.99 USD.
Write extractPrices(text) that returns an array of numbers (as JavaScript floats) for every price-like value in text. Prices are: a dollar sign followed by digits and optional decimal ($12.99, $7, $0.50), OR digits with optional decimal followed by ' USD' (12.99 USD, 7 USD). Return results in order of appearance.
extractPrices('Widget $12.99 each') → [12.99]extractPrices('$3.99 and 5 USD') → [3.99, 5]Checkpoint 4 — decompose a complex pattern
The function below uses one large pattern to extract a user:password pair from a connection string. Rewrite it as two sequential simpler patterns and explain the trade-off in a comment.
Write parseCredentials(connStr) that extracts { user, password } from a connection string like 'postgres://alice:s3cr3t@db.host:5432/mydb'. Use two separate patterns: one to extract the user:password section, then another to split it into user and password. Return null if the format doesn't match.
parseCredentials('postgres://alice:s3cr3t@db.host:5432/mydb') → { user: 'alice', password: 's3cr3t' }parseCredentials('not-a-url') → nullNotice that the two-step approach handles a password containing a : correctly
(Checkpoint 4, second test). A single combined pattern would need to use a
greedy-vs-lazy trick or a more complex character class to handle this edge case.
The sequential approach makes the intent of each step obvious: "find the
credential block" and "split it at the first colon".
Done?
All four green? You have completed the full Regular Expressions advanced tier.
You now have the tools to:
- Diagnose and fix catastrophic backtracking
- Use possessive quantifiers and atomic groups (and emulate them in JavaScript)
- Benchmark patterns and read step counts in regex debuggers
- Navigate engine differences and choose the right flavour for each environment
- Build multi-step extraction pipelines from log files and unstructured text
- Use regex across the developer toolchain — grep, sed, VS Code, git, PostgreSQL
- Recognise when to stop and reach for a parser instead
The next practice ground is your own work: log files, data migrations, search features, and linter configs. Real text is always messier than examples — but now you know how to read it.