Code of the Day
AdvancedPipeline Design

ETL vs ELT

Two architectures for moving data from source to analysis — and why the rise of cloud data warehouses made ELT the default.

Data ScienceAdvanced6 min read
By the end of this lesson you will be able to:
  • Define ETL (extract, transform, load) and describe its original motivation
  • Define ELT (extract, load, transform) and explain when it is more appropriate
  • Explain how cloud data warehouses changed the economics that favoured ETL

Before data reaches a model or an analyst, it travels through a pipeline that moves it from source systems (databases, APIs, files) to somewhere queryable. Two architectures dominate: ETL and ELT. The letters describe the order of operations, but the choice has significant implications for infrastructure, flexibility, and cost.

ETL: extract, transform, load

In the traditional ETL pattern, data is extracted from source systems, transformed into the target schema on a separate processing machine, then loaded into the destination data store. The transformation happens before loading.

This made sense when storage was expensive. You transformed data to keep only what you needed, in the shape you needed it, before committing it to the destination. The destination was a carefully designed data warehouse with a fixed schema. Loading raw data was not an option — storage cost money, and querying unstructured raw data was too slow on the hardware of the time.

The downside is rigidity. If analysis requirements change, you need to modify the transformation logic and reprocess historical data. The transformation code is a bottleneck: it is often a separate system maintained by a different team from the analysts who need the data.

ELT: extract, load, transform

In ELT, data is extracted from source systems and loaded raw into the destination first. Transformation happens inside the destination, using the warehouse's own compute.

The shift became practical when two things changed:

Storage became cheap. Cloud object storage (S3, GCS) costs a few dollars per terabyte per month. Loading raw data is no longer a financial decision.

In-warehouse compute became powerful. BigQuery, Snowflake, and Redshift can run complex SQL transformations over terabytes in seconds. There is no longer a speed penalty for transforming inside the warehouse versus outside it.

The advantages are significant: raw data is always preserved (you can re-derive any transformation), analysts can write transformations in SQL without involving a separate engineering pipeline, and new analytical requirements are met by writing a new query rather than re-architecting the transformation step.

The dbt (data build tool) ecosystem is built entirely on the ELT pattern: raw data lands in the warehouse via a separate loader (Fivetran, Airbyte, custom scripts), and dbt manages the transformation layer as versioned SQL models. This separation of concerns — loading vs transforming — is one reason dbt became ubiquitous quickly.

Choosing between them

For most modern data science teams working with cloud infrastructure, ELT is the default. ETL still makes sense when:

  • The destination has strict storage or compute constraints.
  • Data must be masked or anonymised before landing (compliance requirements may prohibit storing raw PII anywhere).
  • The transformation is computationally intensive and the warehouse's query engine is not cost-effective for that workload.

For local or on-premise pipelines without access to powerful warehouse compute, ETL with Python (pandas, polars, dbt-core with SQLite) remains practical.

Where to go next

Next: reproducible pipelines — the engineering practices that ensure a pipeline produces the same output from the same input, regardless of when or where it runs.

Finished reading? Mark it complete to track your progress.

On this page