Lab: mini analysis
Load, clean, summarise, and group a dataset end-to-end — a guided exploratory analysis from scratch.
- Perform a complete exploratory analysis pipeline on a new dataset
- Load, inspect, clean, compute statistics, and group in sequence
- Interpret what the numbers tell you, not just how to compute them
This is an optional lab. No new syntax — just a realistic mini-analysis that strings together everything from both modules. Work through each section, run the code, and read the output carefully. The goal is to build the habit of thinking about what the numbers mean, not just producing them.
The dataset is a month of sales at a small online bookshop: 12 orders across three categories, with a few quality problems baked in. Your job is to answer the question: which category generates the most revenue from completed orders?
Step 1 — load and inspect
Always start here. Look at the data before touching anything.
What to notice: price is object (string) even though it should be numeric.
There is one null in price. Two orders are "refunded" and should be excluded
from revenue totals.
Step 2 — clean
Fix the three problems: drop the null row, convert price to float.
Step 3 — add a revenue column and filter
Revenue per order is price * quantity. Add it as a new column, then keep only
completed orders.
Step 4 — group and answer the question
Now split by category and compute total and mean revenue per group.
The named aggregation syntax — agg(total_revenue="sum", ...) — gives your
result columns descriptive names instead of the default "sum", "mean", etc.
Worth using whenever the output will be read by others (or by you in three weeks).
Interpret the result
Science has the highest total revenue despite fewer orders than fiction — because science books are more expensive and orders contain more copies. Fiction has the most orders but lower revenue per order. Non-fiction sits in between.
That is a real insight: total order count is not the same as total revenue. You can only see this by computing revenue explicitly and grouping.
Done?
You just ran a complete mini-analysis pipeline: load, inspect, clean, engineer a feature, filter, group, and interpret. Every real data project is a longer version of this same sequence. The tools scale; the pattern does not change.