Data extraction pipelines

Use regex as part of a larger pipeline — pre-processing text, extracting structured fields, normalising values, and composing sequential patterns.

A single regex is rarely the whole story. In practice, extraction lives inside a pipeline: clean the input, extract fields, normalise what you extracted, and validate the result. Regex is one stage in that pipeline — not the entire pipeline. This lesson builds that mental model with a worked example.

The product catalogue problem

Suppose a marketing team has produced a raw product description file. Each line is a different product, described in free text with no consistent structure:

Widget Pro 500ml @ $12.99 each | SKU: WP-500-BLU
Gadget Lite 250 mL - Price $7.50 / unit SKU:GL-250-RED
Super Gizmo 1L $24.00 (sku: SG-1000-GRN)
Nano Device 100ml $3.99 each | product code: ND-100-WHT
Mega Tool 2 liters -- $49.99 per unit [SKU: MT-2000-BLK]

You need to produce a structured table: { name, volumeMl, priceDollars, sku }.

Step 1 — pre-process: normalise before matching

Before any extraction, make the text uniform:

function normalise(line) {
  return line
    .toLowerCase()           // case-insensitive matching later
    .replace(/\s+/g, " ")   // collapse multiple spaces
    .trim();
}

Normalisation should happen before extraction, not inside the extraction pattern. Keeping the pattern simpler reduces backtracking risk and makes it easier to test.

Step 2 — extract the volume

Volume appears as a number followed by ml, mL, l, liter, or liters. After lowercasing, you only need to handle the lowercase variants:

function extractVolumeMl(line) {
  const m = line.match(/(\d+(?:\.\d+)?)\s*(?:ml|l(?:iter)?s?)\b/);
  if (!m) return null;
  const num = parseFloat(m[1]);
  // Convert litres to ml
  const unit = m[0].replace(m[1], "").trim();
  return unit.startsWith("l") && !unit.startsWith("litre") ? num * 1000 : num;
}

Wait — that logic is getting complex inside the function. Better to separate the unit detection:

function extractVolumeMl(line) {
  const m = line.match(/(\d+(?:\.\d+)?)\s*(ml|liters?|l)\b/);
  if (!m) return null;
  const num = parseFloat(m[1]);
  const unit = m[2];
  if (unit === "ml") return num;
  return num * 1000;   // l, liter, liters → ml
}

Two simple branches, easy to test, easy to extend.

Step 3 — extract the price

Prices appear as $12.99 or $7.50:

function extractPrice(line) {
  const m = line.match(/\$(\d+(?:\.\d{2})?)/);
  return m ? parseFloat(m[1]) : null;
}

The \d+(?:\.\d{2})? pattern matches an integer or a decimal with exactly two decimal places — a price-specific constraint that rejects $1.5 (probably a rate or ratio) while accepting $1.50.

Step 4 — extract the SKU

SKUs appear after various labels: SKU:, product code:, sku :, etc. After normalising to lowercase:

function extractSku(line) {
  const m = line.match(/(?:sku|product code)\s*:\s*([a-z0-9][a-z0-9-]*)/i);
  return m ? m[1].toUpperCase() : null;
}

The [a-z0-9][a-z0-9-]* matches the SKU format (starts with alphanumeric, may contain hyphens). Calling .toUpperCase() on the result normalises it back regardless of input case.

Composing the pipeline

JavaScript — editable, runs in your browser

const catalogue = [
"Widget Pro 500ml @ $12.99 each | SKU: WP-500-BLU",
"Gadget Lite 250 mL - Price $7.50 / unit SKU:GL-250-RED",
"Super Gizmo 1L $24.00 (sku: SG-1000-GRN)",
"Nano Device 100ml $3.99 each | product code: ND-100-WHT",
"Mega Tool 2 liters -- $49.99 per unit [SKU: MT-2000-BLK]",
];

function normalise(line) {
return line.toLowerCase().replace(/\s+/g, " ").trim();
}

function extractVolumeMl(line) {
const m = line.match(/(\d+(?:\.\d+)?)\s*(ml|liters?|l)\b/);
if (!m) return null;
const num = parseFloat(m[1]);
return m[2] === "ml" ? num : num * 1000;
}

function extractPrice(line) {
const m = line.match(/\$(\d+(?:\.\d{2})?)/);
return m ? parseFloat(m[1]) : null;
}

function extractSku(line) {
const m = line.match(/(?:sku|product code)\s*:\s*([a-z0-9][a-z0-9-]*)/i);
return m ? m[1].toUpperCase() : null;
}

function extractName(line) {
// Name is the leading text before the first price, volume, or SKU marker
const m = line.match(/^([a-z][a-z\s]+?)(?=\d|@|\$|sku|product)/i);
return m ? m[1].trim() : null;
}

const results = catalogue.map(raw => {
const line = normalise(raw);
return {
  name:         extractName(raw),
  volumeMl:     extractVolumeMl(line),
  priceDollars: extractPrice(line),
  sku:          extractSku(line),
};
});

results.forEach(r => console.log(JSON.stringify(r)));

When to split a complex pattern

You might be tempted to write a single regex that captures all four fields at once. Here is what that would look like:

// Attempting everything in one pattern
const combined = /^(?<name>[A-Za-z][A-Za-z\s]+?)\s+(?<vol>\d+(?:\.\d+)?)\s*(?<unit>ml|liters?|l)\b.*\$(?<price>\d+(?:\.\d{2})?).*(?:SKU|product code)\s*:\s*(?<sku>[A-Z0-9-]+)/i;

This works for the happy path, but:

If any field is missing or in a different order, the whole match fails — no partial result.
Debugging which part failed requires careful inspection.
Adding a new field means restructuring the entire pattern.
The .* between fields creates potential backtracking.

Sequential patterns win when:

Fields can be absent (one missing field shouldn't fail extraction of the others)
The order of fields varies across input sources
You want to add, remove, or change one field without touching the others
The pattern would otherwise require .* bridges between sections

A single pattern wins when:

All fields are guaranteed present and in a fixed order
You need to validate the structure as a whole (reject partial matches)
The input is controlled and well-formatted (e.g. your own system's output)

A common hybrid: write a structural pattern that validates the overall format of each line (and rejects lines that don't look like product entries at all), then use simple per-field patterns to extract the values from the lines that pass. The structural pass is a gate; the extraction passes are value retrieval.

Where to go next

The next lesson, Regex in tooling, moves beyond JavaScript and Python code to show how regex works in grep, sed, VS Code, git, and PostgreSQL — places where you often need regex but not a full program.

Finished reading? Mark it complete to track your progress.

On this page