BeginnerData Fundamentals

What is data?

Data is observations stored in a format a program can read — and the format you choose shapes every question you can ask.

Data ScienceBeginner6 min read

By the end of this lesson you will be able to:

Explain what data is in terms of observations and format
Distinguish between structured, unstructured, and semi-structured data
Describe why the shape of data constrains what questions you can ask

Before you load a file or call a single pandas function, it pays to be clear about what data actually is. Data is observations about the world, stored in a format a program can read. An observation might be a customer's age, a temperature reading, a tweet, or a medical scan — but it only becomes data once it is recorded in some structured way.

That last part — "in a format a program can read" — matters more than it sounds. The format determines what operations you can do, which tools you can use, and which questions you can even ask.

Three kinds of data

Data scientists talk about three broad categories:

Structured data follows a strict schema: every record has the same fields, the same types, and the same shape. A spreadsheet of sales orders is structured. A database table is structured. Structured data is easy for machines to query and aggregate — but creating it requires discipline up front.

Unstructured data has no imposed schema. A paragraph of customer feedback, a photograph, an audio clip — these are unstructured. There is information in them, but it is not pre-divided into labelled fields. Extracting structure from unstructured data is one of the hardest problems in the field.

Semi-structured data sits in between. It has some organisational markers — typically tags or keys — but does not enforce a rigid schema across records. JSON and XML are the canonical examples. A JSON object has named keys, but different objects in the same collection can have different keys, or the same key with different types.

Concretely:

Format	Category	Example
CSV (comma-separated values)	Structured	Sales records, sensor logs
JSON	Semi-structured	API responses, config files
Free text	Unstructured	Product reviews, news articles
Images / audio	Unstructured	Photos, voice recordings

Why format constrains your questions

If your data is a CSV of transactions with columns date, amount, and category, you can immediately ask "what is the total spend by category?" That question is easy because the relevant information is already isolated in labelled columns.

Ask the same question of a folder of PDF receipts and you first have to extract the relevant numbers — a completely different (and much harder) task. The format has not changed what the data means, but it has completely changed the difficulty of the analysis.

A useful rule of thumb: the more structured your data, the smaller the gap between "I have the data" and "I can answer questions with it." Most of what data cleaning is about is closing that gap when data arrives less structured than you need.

Check your understanding

Knowledge check

Where to go next

Next: reading data files — using Python's csv module to open a CSV and read its rows as dictionaries, so you can start working with real data.

Finished reading? Mark it complete to track your progress.

Data Science

Use Python to load, clean, and explore data — turning raw observations into answers.

Reading data files

Use Python's csv module and io.StringIO to open a CSV and read its rows as dictionaries.