Code of the Day
BeginnerData Fundamentals

What is data?

Data is observations stored in a format a program can read — and the format you choose shapes every question you can ask.

Data ScienceBeginner6 min read
By the end of this lesson you will be able to:
  • Explain what data is in terms of observations and format
  • Distinguish between structured, unstructured, and semi-structured data
  • Describe why the shape of data constrains what questions you can ask

Before you load a file or call a single function, it pays to be clear about what data actually is. Data is observations about the world, stored in a format a program can read. An observation might be a customer's age, a temperature reading, a tweet, or a medical scan — but it only becomes data once it is recorded in some structured way.

That last part — "in a format a program can read" — matters more than it sounds. The format determines what operations you can do, which tools you can use, and which questions you can even ask.

Three kinds of data

Data scientists talk about three broad categories:

follows a strict schema: every record has the same fields, the same types, and the same shape. A spreadsheet of sales orders is structured. A database table is structured. Structured data is easy for machines to query and aggregate — but creating it requires discipline up front.

Unstructured data has no imposed schema. A paragraph of customer feedback, a photograph, an audio clip — these are unstructured. There is information in them, but it is not pre-divided into labelled fields. Extracting structure from unstructured data is one of the hardest problems in the field.

Semi-structured data sits in between. It has some organisational markers — typically tags or keys — but does not enforce a rigid schema across records. JSON and XML are the canonical examples. A JSON object has named keys, but different objects in the same collection can have different keys, or the same key with different types.

Concretely:

FormatCategoryExample
CSV (comma-separated values)StructuredSales records, sensor logs
JSONSemi-structuredAPI responses, config files
Free textUnstructuredProduct reviews, news articles
Images / audioUnstructuredPhotos, voice recordings

Why format constrains your questions

If your data is a CSV of transactions with columns date, amount, and category, you can immediately ask "what is the total spend by category?" That question is easy because the relevant information is already isolated in labelled columns.

Ask the same question of a folder of PDF receipts and you first have to extract the relevant numbers — a completely different (and much harder) task. The format has not changed what the data means, but it has completely changed the difficulty of the analysis.

A useful rule of thumb: the more structured your data, the smaller the gap between "I have the data" and "I can answer questions with it." Most of what data cleaning is about is closing that gap when data arrives less structured than you need.

Check your understanding

Knowledge check

  1. 1.
    Which of these is an example of structured data?
  2. 2.
    Changing the format of data (e.g. from CSV to free text) changes what questions you can ask of it.
  3. 3.
    What makes JSON "semi-structured" rather than fully structured?

Where to go next

Next: reading data files — using Python's csv module to open a CSV and read its rows as dictionaries, so you can start working with real data.

Finished reading? Mark it complete to track your progress.

On this page