Lab: real-world applications

Four exercises on realistic messy text — log error extraction, URL parsing, price normalisation, and pattern decomposition.

Optional lab. These exercises work with the kind of messy, inconsistent real-world text that automated agents and scripts encounter every day. Each checkpoint includes a solution and explanation. Try to write the solution yourself before reading ahead.

Warm up — anatomy of a log line

Before writing any patterns, explore what the data looks like:

JavaScript — editable, runs in your browser

Checkpoint 1 — extract 4xx errors

Extract all lines with 4xx status codes. For each, return an object with { timestamp, method, path, status }. The timestamp is the content inside […].

Extract 4xx errors from logJavaScript

Write extractErrors(logs) that takes an array of Apache log line strings and returns an array of objects { timestamp, method, path, status } for every line with a 4xx status code (400-499).

extractErrors([...]) on a 401 line → [{ timestamp: '12/Jun/2024:08:00:02 +0000', method: 'POST', path: '/login', status: '401' }]

Checkpoint 2 — extract href URLs from HTML

Extract all href attribute values from anchor tags in an HTML snippet. Then explain (in a comment) why you would use a real parser in production.

Extract href values from HTMLJavaScript

Write extractHrefs(html) that returns an array of all href attribute values found in anchor tags. Handle both double-quoted and single-quoted values. Return an empty array if none found. (This works for simple inputs — the next step in production is a real HTML parser.)

extractHrefs('<a href="https://example.com">link</a>') → ['https://example.com']

The exercise above works for simple, well-formed HTML. In production, it will miss: href values with spaces before =, attributes on multiple lines, values containing escaped quotes, and links inside HTML comments. For any real HTML document, use DOMParser (browser) or cheerio/node-html-parser (Node.js). The regex version is acceptable for controlled, machine-generated output where you own the format.

Checkpoint 3 — price normalisation pipeline

Extract all price-like strings from a product catalogue and normalise them to floats. Prices appear in formats like $12.99, $7.50, $24, 12.99 USD.

Extract and normalise pricesJavaScript

Write extractPrices(text) that returns an array of numbers (as JavaScript floats) for every price-like value in text. Prices are: a dollar sign followed by digits and optional decimal ($12.99, $7, $0.50), OR digits with optional decimal followed by ' USD' (12.99 USD, 7 USD). Return results in order of appearance.

extractPrices('Widget $12.99 each') → [12.99]extractPrices('$3.99 and 5 USD') → [3.99, 5]

Checkpoint 4 — decompose a complex pattern

The function below uses one large pattern to extract a user:password pair from a connection string. Rewrite it as two sequential simpler patterns and explain the trade-off in a comment.

Decompose a complex patternJavaScript

Write parseCredentials(connStr) that extracts { user, password } from a connection string like 'postgres://alice:s3cr3t@db.host:5432/mydb'. Use two separate patterns: one to extract the user:password section, then another to split it into user and password. Return null if the format doesn't match.

parseCredentials('postgres://alice:s3cr3t@db.host:5432/mydb') → { user: 'alice', password: 's3cr3t' }parseCredentials('not-a-url') → null

Notice that the two-step approach handles a password containing a : correctly (Checkpoint 4, second test). A single combined pattern would need to use a greedy-vs-lazy trick or a more complex character class to handle this edge case. The sequential approach makes the intent of each step obvious: "find the credential block" and "split it at the first colon".

Done?

All four green? You have completed the full Regular Expressions advanced tier.

You now have the tools to:

Diagnose and fix catastrophic backtracking
Use possessive quantifiers and atomic groups (and emulate them in JavaScript)
Benchmark patterns and read step counts in regex debuggers
Navigate engine differences and choose the right flavour for each environment
Build multi-step extraction pipelines from log files and unstructured text
Use regex across the developer toolchain — grep, sed, VS Code, git, PostgreSQL
Recognise when to stop and reach for a parser instead

The next practice ground is your own work: log files, data migrations, search features, and linter configs. Real text is always messier than examples — but now you know how to read it.

Finished reading? Mark it complete to track your progress.