Data extraction pipelines
Use regex as part of a larger pipeline — pre-processing text, extracting structured fields, normalising values, and composing sequential patterns.
- Pre-process unstructured text before applying extraction patterns
- Extract prices, units, and product codes from a realistic catalogue blob
- Normalise extracted values (case, whitespace, format) using replace
- Decompose a complex single pattern into a pipeline of simpler ones
A single regex is rarely the whole story. In practice, extraction lives inside a pipeline: clean the input, extract fields, normalise what you extracted, and validate the result. Regex is one stage in that pipeline — not the entire pipeline. This lesson builds that mental model with a worked example.
The product catalogue problem
Suppose a marketing team has produced a raw product description file. Each line is a different product, described in free text with no consistent structure:
Widget Pro 500ml @ $12.99 each | SKU: WP-500-BLU
Gadget Lite 250 mL - Price $7.50 / unit SKU:GL-250-RED
Super Gizmo 1L $24.00 (sku: SG-1000-GRN)
Nano Device 100ml $3.99 each | product code: ND-100-WHT
Mega Tool 2 liters -- $49.99 per unit [SKU: MT-2000-BLK]You need to produce a structured table: { name, volumeMl, priceDollars, sku }.
Step 1 — pre-process: normalise before matching
Before any extraction, make the text uniform:
function normalise(line) {
return line
.toLowerCase() // case-insensitive matching later
.replace(/\s+/g, " ") // collapse multiple spaces
.trim();
}Normalisation should happen before extraction, not inside the extraction pattern. Keeping the pattern simpler reduces backtracking risk and makes it easier to test.
Step 2 — extract the volume
Volume appears as a number followed by ml, mL, l, liter, or liters.
After lowercasing, you only need to handle the lowercase variants:
function extractVolumeMl(line) {
const m = line.match(/(\d+(?:\.\d+)?)\s*(?:ml|l(?:iter)?s?)\b/);
if (!m) return null;
const num = parseFloat(m[1]);
// Convert litres to ml
const unit = m[0].replace(m[1], "").trim();
return unit.startsWith("l") && !unit.startsWith("litre") ? num * 1000 : num;
}Wait — that logic is getting complex inside the function. Better to separate the unit detection:
function extractVolumeMl(line) {
const m = line.match(/(\d+(?:\.\d+)?)\s*(ml|liters?|l)\b/);
if (!m) return null;
const num = parseFloat(m[1]);
const unit = m[2];
if (unit === "ml") return num;
return num * 1000; // l, liter, liters → ml
}Two simple branches, easy to test, easy to extend.
Step 3 — extract the price
Prices appear as $12.99 or $7.50:
function extractPrice(line) {
const m = line.match(/\$(\d+(?:\.\d{2})?)/);
return m ? parseFloat(m[1]) : null;
}The \d+(?:\.\d{2})? pattern matches an integer or a decimal with exactly two
decimal places — a price-specific constraint that rejects $1.5 (probably a
rate or ratio) while accepting $1.50.
Step 4 — extract the SKU
SKUs appear after various labels: SKU:, product code:, sku :, etc. After
normalising to lowercase:
function extractSku(line) {
const m = line.match(/(?:sku|product code)\s*:\s*([a-z0-9][a-z0-9-]*)/i);
return m ? m[1].toUpperCase() : null;
}The [a-z0-9][a-z0-9-]* matches the SKU format (starts with alphanumeric,
may contain hyphens). Calling .toUpperCase() on the result normalises it back
regardless of input case.
Composing the pipeline
When to split a complex pattern
You might be tempted to write a single regex that captures all four fields at once. Here is what that would look like:
// Attempting everything in one pattern
const combined = /^(?<name>[A-Za-z][A-Za-z\s]+?)\s+(?<vol>\d+(?:\.\d+)?)\s*(?<unit>ml|liters?|l)\b.*\$(?<price>\d+(?:\.\d{2})?).*(?:SKU|product code)\s*:\s*(?<sku>[A-Z0-9-]+)/i;This works for the happy path, but:
- If any field is missing or in a different order, the whole match fails — no partial result.
- Debugging which part failed requires careful inspection.
- Adding a new field means restructuring the entire pattern.
- The
.*between fields creates potential backtracking.
Sequential patterns win when:
- Fields can be absent (one missing field shouldn't fail extraction of the others)
- The order of fields varies across input sources
- You want to add, remove, or change one field without touching the others
- The pattern would otherwise require
.*bridges between sections
A single pattern wins when:
- All fields are guaranteed present and in a fixed order
- You need to validate the structure as a whole (reject partial matches)
- The input is controlled and well-formatted (e.g. your own system's output)
A common hybrid: write a structural pattern that validates the overall format of each line (and rejects lines that don't look like product entries at all), then use simple per-field patterns to extract the values from the lines that pass. The structural pass is a gate; the extraction passes are value retrieval.
Where to go next
The next lesson, Regex in tooling, moves beyond JavaScript and Python code to show how regex works in grep, sed, VS Code, git, and PostgreSQL — places where you often need regex but not a full program.