Lab: prepare a dataset
Take a raw mixed-type dataset end-to-end through cleaning, encoding, train/test splitting, and scaling — producing a model-ready train/test pair.
- Clean missing values and fix types in a raw dataset
- Apply one-hot and ordinal encoding to categorical columns
- Split into 80/20 train and test sets
- Scale numeric features on train and apply to test
- Inspect final shapes and verify there is no data leakage
This is an optional lab. No new concepts — just practice running the full feature engineering pipeline on a dataset you have not seen before. Work through each step, verify the shapes, and confirm your output is model-ready before moving on.
The dataset is a small property listing table with mixed types, missing values,
and a categorical column that needs encoding. Your goal is to produce a clean,
scaled train/test pair suitable for a regression model predicting price.
Step 1 — inspect the raw data
Before cleaning, note: bedrooms has 2 missing values, condition and
location are strings. The id column is not a feature — it should be dropped.
Step 2 — clean missing values and drop non-features
Step 3 — encode categorical columns
condition is ordinal (fair < good < excellent). location is nominal.
Step 4 — split into train and test
Split 80/20 before fitting any scaler. The id column was already dropped.
Step 5 — scale numeric features
Scale bedrooms and area_m2. The encoded columns (0/1 binary and ordinal
1–3) are on reasonable scales already, but numeric columns with large ranges
should be standardised.
The scaler is fitted on X_train[numeric_cols] only. Applying
scaler.transform(X_test[numeric_cols]) reuses the mean and std from
training — no information from the test set was used. The train/test
boundary has not been crossed.
Done?
You have taken a raw dataset with missing values, string categories, and mixed scales through a complete feature engineering pipeline: clean missing values, encode categoricals (ordinal and one-hot), split 80/20, and scale numerics on the training set only. The result is a pair of DataFrames ready to pass into a scikit-learn estimator. The next tier — Advanced — builds on this foundation to cover model training, evaluation, and iteration.