Data quality
Missing values, duplicates, wrong types, and outliers contaminate almost every real dataset — learn to recognise them before they corrupt your analysis.
- Recognise the four main data quality problems in a dataset
- Explain why "garbage in, garbage out" makes data cleaning non-optional
- Describe a concrete example of each problem
Real data is never clean. It comes from humans clicking checkboxes carelessly, sensors that lose signal, databases that allow optional fields, and systems that were never designed to talk to each other. A data scientist's first job — before any modelling or visualisation — is to find and fix the problems. Skip this step and every result downstream is suspect.
The principle is blunt: garbage in, garbage out. A perfectly written analysis applied to bad data produces confidently wrong answers.
There are four problems you will encounter in almost every dataset.
1. Missing values
A field that should have a value does not. pandas represents these as NaN
(Not a Number) for numeric columns and None for objects. They appear for many
reasons: a survey respondent skipped a question, a sensor failed for one reading,
a join to another table found no match.
Why they matter: arithmetic on NaN propagates — 5 + NaN is NaN. A mean
calculated over a column with missing values silently excludes them (which may or
may not be what you want). A machine learning model will often refuse to run if
any input contains NaN.
Example: a customer table where age is blank for users who signed up via a
third-party OAuth and never completed their profile.
2. Duplicates
The same observation appears more than once. This happens when data is merged from multiple sources, when a bug causes events to be logged twice, or when a user submits a form twice.
Why they matter: every aggregate (count, sum, average) will be wrong. If you count orders and five orders appear twice, your count is off by five. The error is often invisible — the numbers look plausible, just wrong.
Example: a sales log where a network timeout caused the payment system to retry, inserting the same transaction twice.
3. Wrong types
A column that should be numeric contains strings. A date column contains the text
"N/A" for unknown dates, forcing the whole column to be stored as strings.
Why they matter: you cannot do arithmetic on a string. pandas will happily let you
store "120" in an object column — but df["amount"].mean() will raise a
TypeError rather than give you the average. The column looks fine until you
try to use it.
Example: a CSV exported from a legacy system where the price column occasionally
contains "-" for items that were free, making pandas read the whole column as
text.
4. Outliers
Values that are technically present and correctly typed, but implausibly extreme. An age of 300. A transaction amount of -50 000. A temperature reading of 9999 (a common sentinel value for sensor error).
Why they matter: outliers distort statistics, especially the mean and standard deviation. A single data entry error of 10x the correct value can shift a mean substantially. More dangerously, outliers can look like interesting findings rather than errors — and you may not notice without checking.
Not every outlier is an error. A genuine best-seller may have 100x the sales of a typical product. The discipline is to flag extreme values, investigate them, and make a deliberate decision — not to automatically delete anything that looks unusual.
Check your understanding
Knowledge check
- 1.Which of these are data quality problems you should check for?
- 2.Every outlier in a dataset is a data entry error and should be deleted.
- 3.What does 5 + NaN evaluate to in pandas?
Where to go next
Next: cleaning data — the hands-on counterpart to this lesson. You will use pandas to drop missing values, remove duplicates, and fix column types in a real (if small) dirty DataFrame.