Code of the Day
AdvancedReal-world applications

When not to use regex

Recognise when regex is the wrong tool — HTML parsing, recursive structures, write-only patterns — and learn to use verbose mode for the patterns you do write.

Regular ExpressionsAdvanced10 min read
Recommended first
By the end of this lesson you will be able to:
  • Identify three categories of problem where regex fails or produces unmaintainable code
  • Explain why HTML and XML must not be parsed with regex
  • Recognise when a string method is clearer than a pattern
  • Document a complex pattern using verbose mode with inline comments

You have spent this entire track learning to write regex. Now comes the most important lesson: when to stop. A well-placed "this is not a job for regex" saves more debugging hours than any clever pattern ever could. Regex is a precision instrument — not a universal solvent.

The HTML problem

There is a famous exchange on Stack Overflow where a developer asks how to parse HTML with regex. The accepted answer — partly serious, partly comic — concludes that regular expressions cannot fully process HTML because HTML is not a regular language. It has nested, recursive structure: tags inside tags inside tags, with optional attributes, self-closing variants, CDATA sections, and comments. An NFA regex can match a specific known pattern in HTML, but it cannot correctly handle the full structure.

The real risk is not that the pattern is hard to write — it is that it appears to work on test input but silently breaks on real-world HTML:

// This seems to work on simple cases...
const links = html.match(/href="([^"]+)"/g);

// ...but breaks on:
//   href='single-quotes'
//   href = "space-around-equals"
//   <!-- href="inside-a-comment" -->
//   <a data-href="not-a-link" href="actual-link">

The correct tool for HTML is an HTML parser. In JavaScript, DOMParser is built into the browser. In Node.js, use cheerio or node-html-parser. In Python, use html.parser (standard library) or BeautifulSoup.

// Correct: let the parser handle structure
const doc = new DOMParser().parseFromString(html, 'text/html');
const links = [...doc.querySelectorAll('a[href]')].map(a => a.href);

The same principle applies to XML, JSON, YAML, and any other format with a defined grammar. Use the grammar's parser.

"I just need to grab one attribute" is the most common path to a regex-based HTML bug. The one-attribute pattern works until someone adds a comment above the tag, wraps the value in single quotes, or uses a self-closing form. Budget five minutes to add the right parser library and save hours of incident debugging later.

Recursive and nested structures

Regex (in most flavours) cannot match arbitrarily nested structures. To match balanced parentheses, you would need to know the maximum nesting depth in advance and write a pattern for each level:

Depth 1: \([^()]*\)
Depth 2: \((?:[^()]|\([^()]*\))*\)
Depth 3: \((?:[^()]|\((?:[^()]|\([^()]*\))*\))*\)

This becomes unreadable immediately and still fails at depth 4. PCRE's (?R) recursive patterns help but are engine-specific and still cannot count or validate structure in the way a parser can.

Examples of inherently recursive structures:

  • Mathematical expressions: (1 + (2 * (3 - 4)))
  • Nested JSON or XML
  • Template literals with nested interpolation
  • SQL queries with subqueries
  • Programming language syntax (any production-quality use)

For these, use a parser combinator (e.g. parsimmon in JS, pyparsing in Python), a PEG parser (e.g. peggy, Python's parsimonious), or a full parser generator like ANTLR or tree-sitter.

When string methods are clearer

A short fixed string at a known position does not need a pattern:

// Checking a fixed prefix
// Regex version — works but communicates nothing about intent
if (/^https:\/\//.test(url)) { ... }

// String method — immediately readable
if (url.startsWith("https://")) { ... }

// Checking a fixed suffix
if (filename.endsWith(".min.js")) { ... }

// Finding a fixed substring
if (message.includes("ERROR")) { ... }

A good rule of thumb: if you can express the condition in plain English without using the word "pattern", a string method is likely clearer. "Starts with HTTPS" → startsWith. "Contains the word ERROR" → includes. "Is exactly 'admin'" → === 'admin'.

The regex version is not wrong, but it imposes a mental parsing step on every reader. Reserve patterns for genuine pattern matching.

The write-only regex problem

A complex pattern is easy to write and hard to read six months later:

// What does this do?
/^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$/

(It validates an IPv4 address.) Without context, this is opaque to most readers. The more complex your pattern, the higher the maintenance cost — future developers (including you) will be reluctant to modify it.

Solutions:

  1. Break it into named pieces (when using it in code):
const OCTET = "(?:25[0-5]|2[0-4]\\d|[01]?\\d\\d?)";
const IPv4_RE = new RegExp(`^(?:${OCTET}\\.){3}${OCTET}$`);
  1. Use verbose mode ((?x) in PCRE/Python, re.VERBOSE flag) to add inline whitespace and comments:
import re

ipv4 = re.compile(r"""
    ^                       # start of string
    (?:
        (?:25[0-5]          # 250-255
        | 2[0-4]\d          # 200-249
        | [01]?\d\d?        # 0-199
        )
        \.                  # dot separator
    ){3}                    # first three octets
    (?:25[0-5] | 2[0-4]\d | [01]?\d\d?)   # fourth octet
    $                       # end of string
""", re.VERBOSE)

In verbose mode, whitespace is ignored and # introduces a comment to the end of the line. The pattern is the same automaton — only the source representation changes.

JavaScript does not have a verbose mode flag, but you can use String.raw and variable composition to achieve a similar effect:

const OCTET   = String.raw`(?:25[0-5]|2[0-4]\d|[01]?\d\d?)`;
const IPv4_RE = new RegExp(
  `^(?:${OCTET}\\.){3}${OCTET}$`
  // ^ start    three octets + dots    fourth octet  end $
);
JavaScript — editable, runs in your browser

A decision checklist

Before reaching for regex, ask:

QuestionIf yes →
Does the format have a grammar (HTML, JSON, SQL, …)?Use the format's parser
Can the structure be arbitrarily nested?Use a parser combinator or generator
Is it a fixed prefix, suffix, or exact string?Use startsWith, endsWith, includes, ===
Will the pattern be longer than ~40 characters?Use named variables or verbose mode
Is the pattern running against untrusted user input?Check for backtracking risk first
Is it a standard format (email, URL, phone, date)?Use a battle-tested library, not a custom pattern

Regex is excellent for pattern detection in text that doesn't have its own parser: log lines, product codes, free-form addresses, search filters. It is the wrong tool for anything with a recursive or context-sensitive grammar.

Where to go next

The Real-world applications lab gives you four exercises on realistic, messy text: log parsing for errors, URL extraction from HTML snippets, price normalisation, and pattern decomposition. Apply the full toolkit — and the restraint you just learned.

Finished reading? Mark it complete to track your progress.

On this page