Lab: prepare a dataset

Take a raw mixed-type dataset end-to-end through cleaning, encoding, train/test splitting, and scaling — producing a model-ready train/test pair.

This is an optional lab. No new concepts — just practice running the full feature engineering pipeline on a dataset you have not seen before. Work through each step, verify the shapes, and confirm your output is model-ready before moving on.

The dataset is a small property listing table with mixed types, missing values, and a categorical column that needs encoding. Your goal is to produce a clean, scaled train/test pair suitable for a regression model predicting price.

Step 1 — inspect the raw data

Python — editable, runs in your browser

import pandas as pd

raw = pd.DataFrame({
  "id":        list(range(1, 21)),
  "bedrooms":  [2, 3, None, 4, 2, 3, 1, 3, 2, 4,
                3, 2, 4, 1, 3, 2, 3, 4, None, 2],
  "area_m2":   [65, 90, 78, 120, 55, 85, 42, 95, 70, 130,
                88, 60, 115, 45, 80, 62, 92, 125, 75, 58],
  "condition": ["good","excellent","good","excellent","fair",
                "good","fair","excellent","good","excellent",
                "good","fair","excellent","fair","good",
                "excellent","good","excellent","good","fair"],
  "location":  ["urban","suburban","urban","rural","urban",
                "suburban","urban","suburban","urban","rural",
                "suburban","urban","rural","urban","suburban",
                "urban","suburban","rural","urban","suburban"],
  "price":     [280000, 420000, 330000, 550000, 240000,
                390000, 195000, 445000, 310000, 580000,
                405000, 265000, 520000, 205000, 375000,
                295000, 415000, 560000, 320000, 260000],
})

print("Shape:", raw.shape)
print("\ndtypes:")
print(raw.dtypes)
print("\nMissing values:")
print(raw.isnull().sum())
print("\nFirst 5 rows:")
print(raw.head())

Before cleaning, note: bedrooms has 2 missing values, condition and location are strings. The id column is not a feature — it should be dropped.

Step 2 — clean missing values and drop non-features

Python — editable, runs in your browser

import pandas as pd

# Drop the id column — not a feature
df = raw.drop(columns=["id"])

# Fill missing bedrooms with the median (a simple imputation strategy)
median_bedrooms = df["bedrooms"].median()
df["bedrooms"] = df["bedrooms"].fillna(median_bedrooms)

print("Missing values after cleaning:", df.isnull().sum().sum())
print("Bedrooms dtype:", df["bedrooms"].dtype)
print("Shape:", df.shape)

Step 3 — encode categorical columns

condition is ordinal (fair < good < excellent). location is nominal.

Python — editable, runs in your browser

import pandas as pd

raw = pd.DataFrame({
  "bedrooms":  [2, 3, 3, 4, 2, 3, 1, 3, 2, 4,
                3, 2, 4, 1, 3, 2, 3, 4, 3, 2],
  "area_m2":   [65, 90, 78, 120, 55, 85, 42, 95, 70, 130,
                88, 60, 115, 45, 80, 62, 92, 125, 75, 58],
  "condition": ["good","excellent","good","excellent","fair",
                "good","fair","excellent","good","excellent",
                "good","fair","excellent","fair","good",
                "excellent","good","excellent","good","fair"],
  "location":  ["urban","suburban","urban","rural","urban",
                "suburban","urban","suburban","urban","rural",
                "suburban","urban","rural","urban","suburban",
                "urban","suburban","rural","urban","suburban"],
  "price":     [280000, 420000, 330000, 550000, 240000,
                390000, 195000, 445000, 310000, 580000,
                405000, 265000, 520000, 205000, 375000,
                295000, 415000, 560000, 320000, 260000],
})

# Ordinal encoding for condition
condition_map = {"fair": 1, "good": 2, "excellent": 3}
df = raw.copy()
df["condition_code"] = df["condition"].map(condition_map)
df = df.drop(columns=["condition"])

# One-hot encoding for location
df = pd.get_dummies(df, columns=["location"], prefix="loc")

print("Columns after encoding:")
print(df.columns.tolist())
print("\nShape:", df.shape)
print("\nFirst 3 rows:")
print(df.head(3))

Step 4 — split into train and test

Split 80/20 before fitting any scaler. The id column was already dropped.

Python — editable, runs in your browser

import pandas as pd
from sklearn.model_selection import train_test_split

condition_map = {"fair": 1, "good": 2, "excellent": 3}
df = raw.copy()
df["condition_code"] = df["condition"].map(condition_map)
df = df.drop(columns=["condition"])
df = pd.get_dummies(df, columns=["location"], prefix="loc")

X = df.drop(columns=["price"])
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

print("Train shape:", X_train.shape)  # expect (16, ...)
print("Test shape: ", X_test.shape)   # expect (4, ...)
print("Feature columns:", X_train.columns.tolist())

Step 5 — scale numeric features

Scale bedrooms and area_m2. The encoded columns (0/1 binary and ordinal 1–3) are on reasonable scales already, but numeric columns with large ranges should be standardised.

Python — editable, runs in your browser

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop(columns=["price"])
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

numeric_cols = ["bedrooms", "area_m2"]
scaler = StandardScaler()

X_train = X_train.copy()
X_test  = X_test.copy()

X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols]  = scaler.transform(X_test[numeric_cols])

print("Final train shape:", X_train.shape)
print("Final test shape: ", X_test.shape)
print("\nTrain numeric stats after scaling:")
print(X_train[numeric_cols].describe().round(3))
print("\nSample of final train features:")
print(X_train.head(3).round(3).to_string())

The scaler is fitted on X_train[numeric_cols] only. Applying scaler.transform(X_test[numeric_cols]) reuses the mean and std from training — no information from the test set was used. The train/test boundary has not been crossed.

Done?

You have taken a raw dataset with missing values, string categories, and mixed scales through a complete feature engineering pipeline: clean missing values, encode categoricals (ordinal and one-hot), split 80/20, and scale numerics on the training set only. The result is a pair of DataFrames ready to pass into a scikit-learn estimator. The next tier — Advanced — builds on this foundation to cover model training, evaluation, and iteration.

Finished reading? Mark it complete to track your progress.

Step 1 — inspect the raw data

Step 2 — clean missing values and drop non-features

Step 3 — encode categorical columns

Step 4 — split into train and test

Step 5 — scale numeric features

Done?

On this page