Encoding categoricals
Convert string category columns to numbers with pd.get_dummies() for nominal data and manual ordinal maps for ordered categories.
- Apply one-hot encoding with pd.get_dummies()
- Apply ordinal encoding using a Python dict map
- Explain when each encoding is appropriate
Two kinds of categorical data require different encoding strategies. Nominal categories have no natural order: shirt colours, country names, product types. Ordinal categories have a meaningful order: small/medium/large, low/mid/high satisfaction, education level. Using the wrong encoding can mislead a model into assuming order where there is none, or discarding order where it matters.
One-hot encoding — for nominal categories
One-hot encoding creates one binary column per category value. If colour has
three values — red, blue, green — one-hot produces three columns:
colour_red, colour_blue, colour_green. Each row has a 1 in the column
for its colour and 0 in the others.
pd.get_dummies() handles any number of distinct values automatically.
prefix= gives the new columns a consistent name prefix, which makes them easy
to identify later.
One practical concern: if the column has k distinct values, you get k
columns, and they are perfectly collinear (the sum of all one-hot columns for a
row is always 1). Many models handle this fine; linear models technically only
need k-1 columns. You can drop one with drop_first=True if needed.
Ordinal encoding — for ordered categories
When the categories have a natural order, encode them as integers that preserve that order. A dict map is the cleanest approach:
A model that sees size_code as 1, 2, 3 can learn that larger sizes tend to
weigh more — because the numeric order matches the real-world order. One-hot
encoding for size would have discarded that ordering.
A third technique, target encoding, replaces each category with the mean of the target variable for that category. It can be powerful for high-cardinality columns (hundreds of unique values), but it requires care: you must compute the mean on the training set only and apply it to the test set, or you introduce data leakage.
Choosing the encoding
| Category type | Encoding | Why |
|---|---|---|
| Nominal (unordered) | One-hot | No numeric order should be implied |
| Ordinal (ordered) | Integer map | Preserves the rank relationship |
| High cardinality nominal | Target encoding (advanced) | One-hot with 500 columns is impractical |
Where to go next
Next: scaling and normalisation — why numeric features need to be on comparable scales for some models, and the difference between min-max and z-score standardisation.
What is a feature?
A feature is a numeric input to a model — and most raw data isn't numeric yet. Feature engineering bridges that gap.
Scaling and normalisation
Min-max scaling and z-score standardisation — what each does, which models need scaling, and why the scaler must be fit on training data only.