Decision tree practice
Fit DecisionTreeClassifier at three depths, compare train vs test accuracy, and make overfitting visible in the output.
- Fit DecisionTreeClassifier at depths 1, 3, and 10 on the same split
- Compare training and test accuracy at each depth
- Identify the depth at which overfitting becomes visible
The concept lesson described overfitting as a large gap between training and test accuracy. This lesson makes that gap visible in practice by fitting the same model at three different depths and watching the numbers change.
Reading the output
Depth 1 is severely underfitted. A single split cannot capture the structure of a 10-feature dataset. Both training and test accuracy are low — the hallmark of high bias. The gap between them is small because the model is not complex enough to overfit.
Depth 3 is often in the productive zone for this kind of dataset. Training accuracy has risen; test accuracy has also risen — the model has enough complexity to find real signal. The gap remains modest.
Depth 10 is where overfitting becomes explicit. Training accuracy approaches 1.0 (the tree has memorised the training set), while test accuracy may be similar to depth 3 or even lower. The gap is the variance penalty.
The practical lesson: more depth does not always mean better test performance. There is a depth at which test accuracy peaks; beyond it, you are fitting noise.
Extending the experiment
To find the optimal depth empirically, loop over a range and store both scores:
for d in range(1, 20):
clf = DecisionTreeClassifier(max_depth=d, random_state=1)
clf.fit(X_train, y_train)
# store clf.score(X_train, y_train), clf.score(X_test, y_test)Plotting those two curves against depth produces a learning-curve variant that
makes the optimal-depth region visually obvious. This is exactly the kind of
sweep that GridSearchCV automates in the hyperparameter tuning lesson.
Using the test set to pick the best depth is a form of data leakage — you have effectively tuned to the test set. The correct approach is to use a validation split or cross-validation for the sweep, then evaluate the chosen model on the test set exactly once. GridSearchCV handles this correctly.
Where to go next
The next lesson assembles these pieces into a sklearn pipeline — a single object that chains a scaler, an encoder, and a model, eliminating the risk of data leakage by design.
Decision trees
How trees split data, why max_depth is the single most important hyperparameter, and how to recognise overfitting before it reaches production.
Sklearn pipelines
Chain a scaler and a model into one object — and why Pipeline.fit() is the only safe way to avoid training-set leakage into preprocessing.