What is a feature?
A feature is a numeric input to a model — and most raw data isn't numeric yet. Feature engineering bridges that gap.
- Define "feature" in the machine learning sense
- Distinguish raw data from an engineered feature
- Give examples of features a model can and cannot use directly
Machine learning models are mathematical functions. They receive numbers as input, perform arithmetic on them, and return a number (or a probability). That single fact has a large consequence: every piece of information you want a model to use must eventually become a number. A feature is one such number — one column in the matrix of inputs the model receives for a given row.
Raw data vs features
Raw data rarely arrives in a model-ready form. Consider a housing dataset with these columns:
| Column | Type | Model-ready? |
|---|---|---|
price | float | Yes — already numeric |
bedrooms | int | Yes — already numeric |
neighbourhood | string | No — categories need encoding |
sale_date | date string | No — dates need extraction |
description | free text | No — text needs numerical representation |
The raw neighbourhood column is a string like "Hackney" or "Islington". A
model cannot compute "Hackney" * 0.3. You must convert it. The raw
sale_date is "2023-04-15". A model can use the year, month, or day-of-week
as numbers — but not the date string directly.
Feature engineering
Feature engineering is the process of constructing model-ready numeric columns from raw data. It is often the step that makes the biggest difference to model performance — more so than the choice of algorithm.
Some common transformations:
From a date column: extract year, month, day_of_week, is_weekend,
days_since_reference. Each becomes a separate feature.
From a string column: compute word_count, char_length, or whether a
keyword appears (has_keyword). For category columns, apply one-hot encoding or
ordinal encoding (covered in the next lesson).
From numeric columns: compute interactions (age * income), polynomial
terms (age^2), bins (pd.cut(age, bins=5)), or ratios
(price / floor_area).
From a birth year: compute age = current_year - birth_year. Models
understand age as a continuous number; a raw birth year has the same numeric
relationship to age, but the interpretation is cleaner.
Domain knowledge drives feature engineering. A model does not know that "day_of_week" matters for a retail sales dataset — you have to hypothesise it, create the feature, and then evaluate whether it helps. This is what makes feature engineering a creative as well as a technical process.
What models cannot use directly
- Strings (including category labels)
- Dates and datetimes (as objects)
- Lists or nested structures
- Columns with inconsistent types (e.g. some rows are
None, some are"N/A", some are a float — the model needs one consistent type)
Any of these must be cleaned and transformed before a model sees them.
Where to go next
Next: encoding categoricals — converting string category columns into numbers using one-hot encoding and ordinal encoding in pandas.