Code of the Day
IntermediateFeature Engineering

Scaling and normalisation

Min-max scaling and z-score standardisation — what each does, which models need scaling, and why the scaler must be fit on training data only.

Data ScienceIntermediate6 min read
Recommended first
By the end of this lesson you will be able to:
  • Explain what min-max scaling and z-score standardisation each produce
  • Identify which types of model are sensitive to feature scale
  • State why the scaler must be fit on training data and not on the full dataset

A DataFrame with columns age (range 18–80) and income (range 20,000–200,000) has a problem. The income column is numerically about 3,000 times larger than the age column. That does not mean income is more important — it just means the two columns live on incomparable scales. Some models handle this fine. Others collapse: they treat income as dominant purely because its numbers are bigger.

Min-max scaling

Min-max scaling maps every value to the range [0, 1]:

x_scaled = (x - x_min) / (x_max - x_min)

The smallest value becomes 0, the largest becomes 1, and everything else is proportional in between. The distribution's shape is preserved — if the original data was right-skewed, the scaled version is still right-skewed.

Min-max scaling is sensitive to outliers. A single extreme value shifts the minimum or maximum, which compresses everything else into a narrow range. If your data has outliers, z-score standardisation or robust scaling (using percentiles instead of min/max) may be more appropriate.

Z-score standardisation

Standardisation shifts the data so it has mean 0 and standard deviation 1:

x_std = (x - mean) / std

There is no fixed output range — values can be negative, and outliers still appear as large positive or negative numbers. But the units are now standard deviations from the mean, which is a meaningful scale regardless of the original units. A score of 2.5 means "2.5 standard deviations above the mean" for any feature.

Which models need scaling?

Distance-based models measure similarity using Euclidean distance or similar metrics. k-Nearest Neighbours and Support Vector Machines compute distances between points. If one feature spans 0–100,000 and another spans 0–1, the large one dominates every distance calculation. These models require scaling.

Gradient-based models (linear regression, logistic regression, neural networks) use an optimiser that takes steps along a gradient. Unscaled features produce an elongated loss surface where the optimiser zig-zags rather than converging cleanly. Scaling makes training faster and more stable.

Tree-based models (decision trees, random forests, gradient boosting) split on thresholds. The split age > 35 is identical whether age is in the original scale or scaled to [0,1]. Trees are not sensitive to feature scale.

Model typeNeeds scaling?
k-NN, SVMYes
Linear/logistic regression, neural netsYes
Decision tree, random forest, XGBoostNo

The data leakage rule

Fitting the scaler means computing the min, max, mean, or standard deviation of the training data. If you fit the scaler on the full dataset (train + test), test data statistics leak into the training process — the model indirectly "sees" information from the test set before evaluation. Always:

  1. Split the data into train and test first.
  2. Fit the scaler on the training set only.
  3. Apply (transform) the fitted scaler to both train and test.

Fitting a scaler on the full dataset before splitting is one of the most common sources of data leakage. It inflates test performance and gives a falsely optimistic picture of how the model will perform on new data.

Where to go next

Next: scaling in practice — applying MinMaxScaler and StandardScaler from scikit-learn, fitting on training data, and transforming both splits.

Finished reading? Mark it complete to track your progress.

On this page