Parsing log files

Build a complete Apache/Nginx access log parser step by step using named groups, then extract multiple fields with matchAll.

Log files are one of the most common places developers reach for regex. A single day of web traffic can produce millions of lines, each in a well-known but not-quite-CSV format. Parsing them with split is fragile; a dedicated log parser is heavy. A well-crafted regex with named groups threads the needle: precise, readable, and fast enough for batch processing.

The access log format

A standard Apache/Nginx combined log line looks like this:

192.168.1.42 - frank [12/Jun/2024:13:45:22 +0000] "GET /api/users HTTP/1.1" 200 1523 "https://example.com/dashboard" "Mozilla/5.0"

The fields in order:

Position	Field	Example
1	Client IP	`192.168.1.42`
2	Ident (almost always `-`)	`-`
3	Auth user (`-` if none)	`frank`
4	Timestamp in brackets	`[12/Jun/2024:13:45:22 +0000]`
5	Request line in quotes	`"GET /api/users HTTP/1.1"`
6	Status code	`200`
7	Response bytes	`1523`
8	Referrer in quotes	`"https://example.com/dashboard"`
9	User agent in quotes	`"Mozilla/5.0"`

Building the pattern piece by piece

Step 1 — IP address

An IPv4 address is four groups of 1–3 digits separated by dots:

const ipPattern = /(?<ip>\d{1,3}(?:\.\d{1,3}){3})/;

The named group (?<ip>…) lets you access match.groups.ip later.

Step 2 — HTTP method and path

The request line is quoted: "METHOD /path HTTP/version". We want the method and the path:

const requestPattern = /"(?<method>[A-Z]+)\s+(?<path>[^\s"]+)\s+HTTP\/[\d.]+"/;

[A-Z]+ matches the verb (GET, POST, etc.). [^\s"]+ matches the path — "anything that isn't whitespace or a quote".

Step 3 — status code and response bytes

Both are integers separated by a space:

const statusPattern = /(?<status>\d{3})\s+(?<bytes>\d+|-)/;

The |- handles the - that some servers write when no bytes were transferred.

Step 4 — referrer

The referrer is quoted (and may be - inside the quotes if there is none):

const referrerPattern = /"(?<referrer>[^"]*)"/;

[^"]* matches anything that isn't a quote — no risk of the pattern crossing into the user agent field.

Step 5 — composing the full pattern

Combine the pieces, accounting for the fixed-format fields in between:

const LOG_PATTERN = new RegExp(
  "(?<ip>\\d{1,3}(?:\\.\\d{1,3}){3})" +  // IP
  "\\s+\\S+\\s+\\S+\\s+" +               // ident, auth (skip)
  "\\[[^\\]]+\\]\\s+" +                   // timestamp (skip)
  '"(?<method>[A-Z]+)\\s+' +             // method
  '(?<path>[^\\s"]+)\\s+HTTP\\/[\\d.]+"\\s+' + // path
  "(?<status>\\d{3})\\s+" +              // status
  '(?<bytes>\\d+|-)\\s+' +              // bytes
  '"(?<referrer>[^"]*)"',               // referrer
  "g"
);

Putting it together with matchAll

The String.prototype.matchAll method returns an iterator of all matches, each with a groups object — perfect for processing a batch of lines:

JavaScript — editable, runs in your browser

// Build the combined log pattern from string concatenation (no nested backticks)
const LOG_PATTERN = new RegExp(
"(?<ip>\\d{1,3}(?:\\.\\d{1,3}){3})" +
"\\s+\\S+\\s+\\S+\\s+" +
"\\[[^\\]]+\\]\\s+" +
'"(?<method>[A-Z]+)\\s+' +
'(?<path>[^\\s"]+)\\s+HTTP\\/[\\d.]+"\\s+' +
"(?<status>\\d{3})\\s+" +
"(?<bytes>\\d+|-)\\s+" +
'"(?<referrer>[^"]*)"',
"g"
);

const logs = [
'192.168.1.42 - frank [12/Jun/2024:13:45:22 +0000] "GET /api/users HTTP/1.1" 200 1523 "https://example.com/dashboard" "Mozilla/5.0"',
'10.0.0.5 - - [12/Jun/2024:13:45:23 +0000] "POST /api/login HTTP/1.1" 401 89 "-" "curl/7.81.0"',
'203.0.113.7 - - [12/Jun/2024:13:45:24 +0000] "GET /static/app.js HTTP/1.1" 304 0 "https://example.com/" "Mozilla/5.0"',
'192.168.1.42 - frank [12/Jun/2024:13:45:25 +0000] "DELETE /api/users/42 HTTP/1.1" 403 55 "https://example.com/admin" "Mozilla/5.0"',
'10.0.0.99 - - [12/Jun/2024:13:45:26 +0000] "GET /api/health HTTP/1.1" 200 18 "-" "healthcheck/1.0"',
].join("\n");

for (const m of logs.matchAll(LOG_PATTERN)) {
const { ip, method, path, status, bytes } = m.groups;
const line = status + " " + method.padEnd(6) + " " + path.padEnd(25) + " [" + ip + "] " + bytes + "b";
console.log(line);
}

Filtering for errors

With structured matches, filtering for 4xx and 5xx errors is straightforward:

JavaScript — editable, runs in your browser

Tips for production log parsing

Compile once — if you process many lines in a loop, compile the regex with new RegExp(…) or a literal outside the loop. Re-compilation on every iteration is a hidden performance cost.
Validate field widths — \d{3} for a status code is safer than \d+ because it rejects malformed lines early rather than producing a surprising match.
Handle the - placeholder — log formats use - for missing optional fields (no referrer, no auth user). Build the |- alternative into fields that can be absent.
Test against real samples — copy 10–20 real lines from your actual log into regex101 before deploying a parser. Edge cases (unusual user agents, paths with spaces encoded as %20, IPv6 addresses) can break a pattern that looks complete on toy examples.

Where to go next

The next lesson, Data extraction pipelines, applies similar techniques to unstructured product data and addresses, showing how to compose patterns into a multi-step extraction workflow.

Finished reading? Mark it complete to track your progress.

On this page