DuckDB Beyond the Hype

Alireza Sadeghi
9 min readSep 18, 2024

--

The original article was published on my Practical Data Engineering Newsletter.

After years of working in the data space, you witness the rise and fall of numerous tools and products. Over time, you become more skeptical of the latest shiny tools, often dismissing them as marketing-driven hype, until you reach a point where they can no longer be ignored.

DuckDB was one such tool for me. Initially, I dismissed it, but eventually, the buzz around it became too loud to ignore. That’s when curiosity got the better of me, and I decided to dive in.

But like some new technologies, I initially struggled to grasp its core functionality. The official website describes it as a “fast in-process analytical database” but this wasn’t very enlightening.

Then I came across the phrase “SQLite for OLAP” which helped me understand it better, though I was still unsure about its place in my mental data stack.

It wasn’t until I explored discussions in data communities like Reddit that I realised DuckDB, like many open-source projects, had evolved beyond its creators’ original vision. That’s when I knew it was time to test it out for myself and decide if the hype was justified.

WHAT DUCKDB IS

In my findings, DuckDB is a hybrid engine with a range of diverse functionalities. Let’s explore these features in more detail.

An Embeddable & Portable Database

DuckDB is an “embeddable” database system. This is what the creators of DuckDB described it as, in a SIGMOD conference paper back in 2019.

Like SQLite, it allows you to store your data in a single .duckdb database file, making it easily portable within other projects.

Besides being embeddable, DuckDB operates much like traditional DBMS systems, but without requiring a long-running server process like MySQL or PostgreSQL. There’s no need for starting a database server and establish a socket connection to it over IP/hostname and port.

All you need to do is point to where the single DuckDB database file is stored, or start an in-memory session without even requiring to use a physical database.

A Columnar OLAP Database

DuckDB is a columnar database, making it highly efficient for running analytical queries.

It supports two main storage formats: its native .duckdb format or open-standard file formats like Parquet, which DuckDB reads and writes with impressive efficiency, including support for predicate pushdown.

Internally, DuckDB uses a row-columnar structure. Data is sliced into row-groups containing 120,000 records, and within each group, columns are stored separately and compressed — similar to popular binary formats like Parquet and ORC.

This architecture allows you to efficiently store and query analytical datasets on a local machine or a single server without the overhead of a full-fledged OLAP database engine.

DuckDB’s internal data structure

Interoperable SQL-Powered Dataframes

The DuckDB’s Python library is essentially a dataframe on steroids. It integrates seamlessly with popular dataframe libraries like Pandas and Polars, allowing efficient in-memory operations.

What sets DuckDB apart is its ability to run SQL queries directly on Python dataframes. You can query Pandas, Polars and Apache Arrow dataframe objects as though they were SQL tables.

For some frameworks such as Apache Arrow, DuckDB uses zero-copy mode for fast conversion. Using zero-copy mode no serialisation is required for translating the in-memory objects between different representations.

DuckDB’s interoperability between different dataframe APIs

You can join data from different objects, such as a Polars dataframe, and a Pandas dataframe, in a single SQL query. The results can be stored back into a DuckDB database or exported to file formats like Parquet on external storage.

A Federated Query Engine

DuckDB offers a simple, efficient way to query external data systems through its extensions. Similar to distributed query engines such as Athena, Presto or Trino, it allows seamless joins across various external data sources.

You can directly query DBMS systems like MySQL and Postgres, open data files like JSON, CSV, and Parquet files stored in cloud storage systems like Amazon S3, and modern open table formats like Apache Iceberg and Delta Lake.

Though DuckDB doesn’t have the concept of external tables found in systems like Hive or Redshift, you can still create persistent Views over the external tables or data files, like a read-only external table.

DuckDB’s federated query capabilities

Querying a remote file in cloud storage can be done without downloading the entire file. DuckDB efficiently samples and inspects data.

A Single-Node Compute Engine

DuckDB can also act as a single-node compute engine, performing ephemeral batch transformations. It’s like having a stand-alone Spark at your disposal for smaller-scale workloads.

This is particularly useful in scenarios like data lake architectures, where DuckDB can efficiently serialise raw data (e.g., JSON or CSV) into optimised formats like Parquet, and then transform or aggregate that data.

I tested this functionality implementing a simple data lake following medallion architecture, using GitHub events data dumps (known as GH Archive) as the data source.

I setup an hourly data ingestion pipeline to collect and load hourly archives from gharchive.org server into my data lake’s Raw zone on S3.

I then created an hourly transformation pipeline using DuckDB to easily unpack, clean and serialise the hourly JSON dumps into Parquet formats published to my Silver (clean) zone.

Finally, I used DuckDB again to run a daily aggregation job to aggregate the Github events grouped by event type, repository, and event date, and export the result to the Gold (analytics) zone.

The following code sample demonstrates the aggregation logic. On my laptop, DuckDB reads and processes 24 compressed Parquet files from S3, containing a total of approximately 5 million records, and exports the results back to cloud in about 5 minutes.

The result was a simple and efficient data pipeline, and anlaytical-ready datasets in Parquet format that could be queried from my local machine using DuckDB’s SQL or Python API.

The following shows the result of the query to get the top GitHub repositories receiving the most stars on 2024–08–27, using DuckDB’s Python API on my laptop:

DuckDB might not be able handle terabyte-scale data like distributed engines, but many use cases don’t require such scale. Besides, modern servers are powerful enough to handle a lot of the heavy lifting without needing distributed processing frameworks.

Given the above capabilities, what would you call a system which is:

An embeddable and portable DBMS, columnar OLAP database, a SQL-based interoperable dataframe, federated query engine, and a single-node compute engine?

Let’s call it DuckDB!

Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.

WHAT DUCKDB CAN BE

Now that we’ve explored what DuckDB is, let’s dive into its future potentials, examining how it can evolve within the broader data engineering and data science ecosystem.

The Missing Piece in the Data Scientist’s Toolbox

DuckDB fills a crucial gap in the data scientist’s toolbox: a powerful in-process database engine that integrates seamlessly into data science workflows without the need for complicated setups, installations, or data transfers between Python and external databases.

According to Mark Raasveld, one of DuckDB’s creators, this was one of the primary design goals for building it.

Data scientists often find the process of setting up and using external databases cumbersome. They prefer working with plain-text or binary data files (like CSV or Parquet) for the ease of immediate access and manipulation.

However, DuckDB changes this dynamic by enabling data scientists to work with a relational SQL database directly within their Python environment, eliminating the need for external database installations and the associated data transfer overhead.

Bring Your Own Compute

As personal computers and laptops become more powerful, DuckDB enables a “bring-your-own-compute” model. It allows users to leverage their local machines to run analytics on shared data, whether that data is stored locally or in the cloud.

Users can easily attach to data files or external database system hosted in the cloud and perform their analyses using DuckDB’s local processing power. This is especially beneficial for collaborative projects, such as those involving research teams.

An alternative hybrid model is also possible, where part of the data processing happens in the cloud while another part is handled locally. This approach, pioneered by MotherDuck, combines cloud scalability with the efficiency of local compute power.

A Standard Format for Publishing Relational and Analytical Datasets

The DuckDB format (.duckdb database file) has the potential to become the standard for sharing relational data on the internet.

Instead of exporting multiple data objects as separate CSV files, dataset publishers can export a complete dataset as a single duckdb file, containing all the data in an optimised, ready-to-analyse columnar format.

This approach streamlines data sharing and ensures that the data is structured for efficient querying and analysis right out of the box.

Expand DataFrames’ Footprint in Data Engineering

The DuckDB Python package bridges SQL and Python, allowing users to query Python in-memory objects like Pandas dataframes as if they were database tables.

This capability makes dataframes accessible to a wider audience, including those more comfortable with SQL than Python. It also enhances the use of dataframes for building data transformation pipelines using SQL.

By leveraging DuckDB’s SQL interface, users could even apply transformational models like dbt on top of their dataframes, further extending the role of dataframes in data engineering workflows.

Turn OLTP Databases into HTAP

The recent introduction of pg_duckdb and pg_analytics is another exciting development.

With pg_duckdb Postgres extension, you can run analytical queries directly within Postgres database without needing to offload data to separate analytical systems.

With pg_analytics extension it’s possible to directly run queries over datasets stored on object stores like S3, and open table formats like Iceberg or Delta Lake directly from Postgres, using the embedded DuckDB as the underlying query engine.

Cloud ProstgreSQL providers such as Crunchy Data are already integrating and taking advantage of DuckDB to extend their service offering for OLAP and analytical use cases.

In-Browser SQL Analytics

In-browser analytics are becoming a reality, thanks to DuckDB’s WebAssembly implementation (DuckDB-WASM). This brings SQL-powered analysis directly into the browser, enabling data analysts to interact with datasets using SQL without installing any software.

It also offers a cost-effective solution for organisations that expose public datasets by allowing the analysis to be done client-side, reducing server load.

For instance, Hugging Face recently introduced a Datasets Explorer Chrome extension, allowing users to explore their datasets using SQL directly in the browser — powered by DuckDB’s engine under the hood. This innovation makes analysing public datasets easier and more accessible than ever.

As we’ve explored, DuckDB is far more than just another database engine — it unlocks new possibilities for data storage, analysis, and computation.

If you’ve discovered additional capabilities or unique use cases for DuckDB, I’d love to hear your thoughts in the comments.

Subscribe to my Practical Data Engineering newsletter to get exclusive early access to my latest stories and insights on data engineering.

--

--

Alireza Sadeghi

Senior Data Engineer with 16 years experience building and scaling software and data platforms My Newsletter: https://practicaldataengineering.substack.com