Reshaping data
Melt converts wide data to long, pivot converts long to wide — understanding which shape your analysis needs saves a lot of friction.
- Explain the difference between wide and long data formats with a concrete example
- Describe when melt (wide to long) and pivot (long to wide) are the right choice
- Identify which shape a given analysis requires before writing any code
The same data can be stored in radically different layouts. Choosing the wrong layout before analysis does not make results wrong — it makes them hard. The two layouts you will encounter constantly are wide and long (sometimes called tidy).
A concrete example
Imagine temperature readings for three cities across four months. In wide format, each city is its own column:
| month | London | Paris | Berlin |
|---|---|---|---|
| Jan | 4 | 6 | 2 |
| Feb | 5 | 8 | 3 |
| Mar | 9 | 13 | 8 |
| Apr | 12 | 17 | 12 |
In long format, there is one row per observation:
| month | city | temp |
|---|---|---|
| Jan | London | 4 |
| Jan | Paris | 6 |
| Jan | Berlin | 2 |
| Feb | London | 5 |
| … | … | … |
The wide table has 4 rows and 4 columns (12 data values). The long table has 12 rows and 3 columns — the same 12 values. Neither representation is more "correct"; each is better suited to different tasks.
When long form is better
Long form is the natural input for most pandas operations. groupby("city") is
trivial in long form — just group on the city column. In wide form, you would
have to loop over column names, which is fragile and verbose.
Long form is also what most plotting libraries expect. To draw a line per city, you need city as a value in a column — not spread across column headers.
Any tidy-data operation that works on a single column of values (aggregations, filters, plots) is easier when your variable of interest is a column, not a row.
When wide form is better
Wide form has its moments too. Computing a correlation matrix across cities
requires them to be columns — df[["London","Paris","Berlin"]].corr() works
directly in wide form, while long form requires a pivot first.
Wide form also reads more naturally as a display table: a human scanning the temperature grid above reads it faster in wide form. If the final output is a table for a report, wide form is usually cleaner.
The general rule: use long form for computation, wide form for display. Start in long, aggregate and analyse, then pivot to wide only at the point where you need a human-readable summary.
melt and pivot as inverses
Melt is the operation that goes wide to long: it takes a set of columns and "melts" them into two columns — one holding the old column names, one holding the values. This is the direction you use most when incoming data is wide (a common format for spreadsheet exports).
Pivot goes the other direction, long to wide: it takes values from one column
and spreads them out into new columns, one per unique value. pivot_table is the
more flexible variant that also applies an aggregation, so you don't need to
pre-deduplicate.
In the next lesson you will run both operations on the temperature dataset above and see exactly what each parameter controls.
Where to go next
Next: window operations — running .melt(), .pivot_table(), .rolling(),
and .shift() in code, and building lag features for time-series analysis.