BeginnerExploring Data

Statistics concepts

Mean, median, mode, and standard deviation — what each one measures and when to reach for each.

Data ScienceBeginner7 min read

By the end of this lesson you will be able to:

Explain mean, median, and mode with a concrete number example
Describe when median is more informative than mean
Explain what standard deviation tells you about a dataset

Before you write a single line of pandas, you need the concepts behind the numbers it produces. Mean, median, mode, and standard deviation are not arcane statistics — they are answers to very practical questions about your data. Reaching for the right one depends on knowing what question each one is actually answering.

Mean — the balance point

The mean (arithmetic average) is the sum of all values divided by the count:

values: 3, 7, 7, 8, 10
sum: 35  |  count: 5  |  mean: 7.0

The mean is the "balance point" of the distribution — if you placed the values on a seesaw, the mean is where it balances. It uses every value, which is its strength and its weakness: one extreme value shifts the balance significantly.

Median — the middle value

The median is the value that falls in the middle when you sort the data. Half the values are below it, half above.

sorted: 3, 7, 7, 8, 10
median: 7  (the middle of 5 values)

For an even count, take the mean of the two middle values. The median is resistant to outliers — adding a value of 1 000 to the list above does not move the median much, but it drags the mean far upward.

This is why income statistics use median rather than mean. A handful of billionaires raises the mean household income dramatically while the median barely moves — the median more accurately represents the "typical" household.

Mode — the most common value

The mode is the value that appears most often.

values: 3, 7, 7, 8, 10
mode: 7  (appears twice; all others appear once)

Mode is most useful for categorical data: the most common product category, the most frequent country in a customer table. For continuous numeric data it is often less meaningful — if every value is unique, there is no mode at all.

Standard deviation — how spread out are the values?

The standard deviation measures how far values typically stray from the mean. A low standard deviation means values cluster closely around the mean; a high one means they are spread out.

tight cluster: 7, 7, 8, 7, 8   → std ≈ 0.5
spread out:    1, 3, 7, 11, 13  → std ≈ 4.6

Standard deviation matters for spotting outliers: a value more than two or three standard deviations from the mean is worth investigating. It also lets you compare variability across different datasets — a consistent manufacturing process has a lower standard deviation than an inconsistent one, even if the means are the same.

A useful heuristic: if your data is roughly symmetric (similar numbers of small and large values), use the mean. If it is skewed — with a long tail of extreme values — use the median. You can often tell by checking whether mean and median differ substantially; a big gap suggests skew.

Check your understanding

Knowledge check

Where to go next

Next: calculating stats — computing these measures on a real pandas Series/DataFrame with .mean(), .median(), .std(), and .value_counts().

Finished reading? Mark it complete to track your progress.

Lab: explore a dataset

Apply inspection and cleaning end-to-end on a new dataset — no step-by-step instructions, just prompts and a starter block.

Calculating stats

Use pandas .mean(), .median(), .std(), and .value_counts() to compute summary statistics on a Series and DataFrame.