Encoding categoricals

Convert string category columns to numbers with pd.get_dummies() for nominal data and manual ordinal maps for ordered categories.

Two kinds of categorical data require different encoding strategies. Nominal categories have no natural order: shirt colours, country names, product types. Ordinal categories have a meaningful order: small/medium/large, low/mid/high satisfaction, education level. Using the wrong encoding can mislead a model into assuming order where there is none, or discarding order where it matters.

One-hot encoding — for nominal categories

One-hot encoding creates one binary column per category value. If colour has three values — red, blue, green — one-hot produces three columns: colour_red, colour_blue, colour_green. Each row has a 1 in the column for its colour and 0 in the others.

Python — editable, runs in your browser

pd.get_dummies() handles any number of distinct values automatically. prefix= gives the new columns a consistent name prefix, which makes them easy to identify later.

One practical concern: if the column has k distinct values, you get k columns, and they are perfectly collinear (the sum of all one-hot columns for a row is always 1). Many models handle this fine; linear models technically only need k-1 columns. You can drop one with drop_first=True if needed.

Ordinal encoding — for ordered categories

When the categories have a natural order, encode them as integers that preserve that order. A dict map is the cleanest approach:

Python — editable, runs in your browser

A model that sees size_code as 1, 2, 3 can learn that larger sizes tend to weigh more — because the numeric order matches the real-world order. One-hot encoding for size would have discarded that ordering.

A third technique, target encoding, replaces each category with the mean of the target variable for that category. It can be powerful for high-cardinality columns (hundreds of unique values), but it requires care: you must compute the mean on the training set only and apply it to the test set, or you introduce data leakage.

Choosing the encoding

Category type	Encoding	Why
Nominal (unordered)	One-hot	No numeric order should be implied
Ordinal (ordered)	Integer map	Preserves the rank relationship
High cardinality nominal	Target encoding (advanced)	One-hot with 500 columns is impractical

Where to go next

Next: scaling and normalisation — why numeric features need to be on comparable scales for some models, and the difference between min-max and z-score standardisation.

Finished reading? Mark it complete to track your progress.

One-hot encoding — for nominal categories

Ordinal encoding — for ordered categories

Choosing the encoding

Where to go next

On this page