Code of the Day
AdvancedSklearn in Practice

Decision tree practice

Fit DecisionTreeClassifier at three depths, compare train vs test accuracy, and make overfitting visible in the output.

Data ScienceAdvanced10 min read
By the end of this lesson you will be able to:
  • Fit DecisionTreeClassifier at depths 1, 3, and 10 on the same split
  • Compare training and test accuracy at each depth
  • Identify the depth at which overfitting becomes visible

The concept lesson described overfitting as a large gap between training and test accuracy. This lesson makes that gap visible in practice by fitting the same model at three different depths and watching the numbers change.

Python — editable, runs in your browser

Reading the output

Depth 1 is severely underfitted. A single split cannot capture the structure of a 10-feature dataset. Both training and test accuracy are low — the hallmark of high bias. The gap between them is small because the model is not complex enough to overfit.

Depth 3 is often in the productive zone for this kind of dataset. Training accuracy has risen; test accuracy has also risen — the model has enough complexity to find real signal. The gap remains modest.

Depth 10 is where overfitting becomes explicit. Training accuracy approaches 1.0 (the tree has memorised the training set), while test accuracy may be similar to depth 3 or even lower. The gap is the variance penalty.

The practical lesson: more depth does not always mean better test performance. There is a depth at which test accuracy peaks; beyond it, you are fitting noise.

Extending the experiment

To find the optimal depth empirically, loop over a range and store both scores:

for d in range(1, 20):
    clf = DecisionTreeClassifier(max_depth=d, random_state=1)
    clf.fit(X_train, y_train)
    # store clf.score(X_train, y_train), clf.score(X_test, y_test)

Plotting those two curves against depth produces a learning-curve variant that makes the optimal-depth region visually obvious. This is exactly the kind of sweep that GridSearchCV automates in the hyperparameter tuning lesson.

Using the test set to pick the best depth is a form of data leakage — you have effectively tuned to the test set. The correct approach is to use a validation split or cross-validation for the sweep, then evaluate the chosen model on the test set exactly once. GridSearchCV handles this correctly.

Where to go next

The next lesson assembles these pieces into a sklearn pipeline — a single object that chains a scaler, an encoder, and a model, eliminating the risk of data leakage by design.

Finished reading? Mark it complete to track your progress.

On this page