Hyperparameter tuning
Use GridSearchCV to sweep max_depth values, read best_params_ and best_score_, and understand why the test set must stay out of the search loop.
- Run GridSearchCV over a single hyperparameter and read the results
- Explain why the test set must not be used during hyperparameter selection
- Interpret best_params_ and best_score_ correctly
Cross-validation tells you how well a model generalises with a specific set of hyperparameters. GridSearchCV automates the obvious next step: try many combinations, use CV to score each, and return the best one. The critical constraint is that the test set must never appear in this loop.
Why the test set must stay out
Suppose you tune max_depth by evaluating on the test set directly. You pick
the depth that scores highest on those specific 60 examples. Now those 60
examples have shaped your choice — you have effectively trained on them.
When you report test accuracy, you are reporting a metric contaminated by the
search itself.
GridSearchCV avoids this by using cross-validation on the training set only. The test set remains completely untouched until a single final evaluation after the best model is selected.
Running GridSearchCV
Reading the output
best_params_ tells you which combination GridSearchCV selected. best_score_
is the mean cross-validation score for that combination — computed entirely on
the training set, using 5-fold CV.
The full sweep table is revealing. Watch the mean_train column: it rises
monotonically as depth increases (more depth = better fit to training data).
The mean_cv column peaks somewhere in the middle — deeper than that, variance
grows faster than bias shrinks, and CV performance degrades.
Final evaluation on the test set
The test accuracy is the number you report and trust. It was computed exactly once, on data that played no role in training or hyperparameter selection.
If the test accuracy is substantially higher than the CV score, something is wrong — likely the test set was seen during the search. If it is lower than the CV score, that is normal: CV scores are slightly optimistic because the training set used in CV is smaller than the full training set used for the final refit. A gap of 1–3 percentage points is expected.
Where to go next
The lab puts all of this together: build a classification pipeline, tune it with GridSearchCV, and report the full suite of metrics on the held-out test set — the honest evaluation protocol end to end.