What is data?
Data is observations stored in a format a program can read — and the format you choose shapes every question you can ask.
- Explain what data is in terms of observations and format
- Distinguish between structured, unstructured, and semi-structured data
- Describe why the shape of data constrains what questions you can ask
Before you load a file or call a single pandas function, it pays to be clear about what data actually is. Data is observations about the world, stored in a format a program can read. An observation might be a customer's age, a temperature reading, a tweet, or a medical scan — but it only becomes data once it is recorded in some structured way.
That last part — "in a format a program can read" — matters more than it sounds. The format determines what operations you can do, which tools you can use, and which questions you can even ask.
Three kinds of data
Data scientists talk about three broad categories:
Structured data follows a strict schema: every record has the same fields, the same types, and the same shape. A spreadsheet of sales orders is structured. A database table is structured. Structured data is easy for machines to query and aggregate — but creating it requires discipline up front.
Unstructured data has no imposed schema. A paragraph of customer feedback, a photograph, an audio clip — these are unstructured. There is information in them, but it is not pre-divided into labelled fields. Extracting structure from unstructured data is one of the hardest problems in the field.
Semi-structured data sits in between. It has some organisational markers — typically tags or keys — but does not enforce a rigid schema across records. JSON and XML are the canonical examples. A JSON object has named keys, but different objects in the same collection can have different keys, or the same key with different types.
Concretely:
| Format | Category | Example |
|---|---|---|
| CSV (comma-separated values) | Structured | Sales records, sensor logs |
| JSON | Semi-structured | API responses, config files |
| Free text | Unstructured | Product reviews, news articles |
| Images / audio | Unstructured | Photos, voice recordings |
Why format constrains your questions
If your data is a CSV of transactions with columns date, amount, and category,
you can immediately ask "what is the total spend by category?" That question is
easy because the relevant information is already isolated in labelled columns.
Ask the same question of a folder of PDF receipts and you first have to extract the relevant numbers — a completely different (and much harder) task. The format has not changed what the data means, but it has completely changed the difficulty of the analysis.
A useful rule of thumb: the more structured your data, the smaller the gap between "I have the data" and "I can answer questions with it." Most of what data cleaning is about is closing that gap when data arrives less structured than you need.
Check your understanding
Knowledge check
- 1.Which of these is an example of structured data?
- 2.Changing the format of data (e.g. from CSV to free text) changes what questions you can ask of it.
- 3.What makes JSON "semi-structured" rather than fully structured?
Where to go next
Next: reading data files — using Python's csv module to open a CSV and read
its rows as dictionaries, so you can start working with real data.