Statistics concepts
Mean, median, mode, and standard deviation — what each one measures and when to reach for each.
- Explain mean, median, and mode with a concrete number example
- Describe when median is more informative than mean
- Explain what standard deviation tells you about a dataset
Before you write a single line of pandas, you need the concepts behind the numbers it produces. Mean, median, mode, and standard deviation are not arcane statistics — they are answers to very practical questions about your data. Reaching for the right one depends on knowing what question each one is actually answering.
Mean — the balance point
The mean (arithmetic average) is the sum of all values divided by the count:
values: 3, 7, 7, 8, 10
sum: 35 | count: 5 | mean: 7.0The mean is the "balance point" of the distribution — if you placed the values on a seesaw, the mean is where it balances. It uses every value, which is its strength and its weakness: one extreme value shifts the balance significantly.
Median — the middle value
The median is the value that falls in the middle when you sort the data. Half the values are below it, half above.
sorted: 3, 7, 7, 8, 10
median: 7 (the middle of 5 values)For an even count, take the mean of the two middle values. The median is resistant to outliers — adding a value of 1 000 to the list above does not move the median much, but it drags the mean far upward.
This is why income statistics use median rather than mean. A handful of billionaires raises the mean household income dramatically while the median barely moves — the median more accurately represents the "typical" household.
Mode — the most common value
The mode is the value that appears most often.
values: 3, 7, 7, 8, 10
mode: 7 (appears twice; all others appear once)Mode is most useful for categorical data: the most common product category, the most frequent country in a customer table. For continuous numeric data it is often less meaningful — if every value is unique, there is no mode at all.
Standard deviation — how spread out are the values?
The standard deviation measures how far values typically stray from the mean. A low standard deviation means values cluster closely around the mean; a high one means they are spread out.
tight cluster: 7, 7, 8, 7, 8 → std ≈ 0.5
spread out: 1, 3, 7, 11, 13 → std ≈ 4.6Standard deviation matters for spotting outliers: a value more than two or three standard deviations from the mean is worth investigating. It also lets you compare variability across different datasets — a consistent manufacturing process has a lower standard deviation than an inconsistent one, even if the means are the same.
A useful heuristic: if your data is roughly symmetric (similar numbers of small and large values), use the mean. If it is skewed — with a long tail of extreme values — use the median. You can often tell by checking whether mean and median differ substantially; a big gap suggests skew.
Check your understanding
Knowledge check
- 1.When is the median a better summary than the mean?
- 2.What does a high standard deviation tell you about a dataset?
- 3.Mode is most useful when working with which kind of data?
Where to go next
Next: calculating stats — computing these measures on a real pandas
Series/DataFrame with .mean(), .median(), .std(), and .value_counts().