The History and Evolution of Open Table Formats

Alireza Sadeghi
22 min readAug 23, 2024

--

This is a scaled down version of the original two-part series published on Practical Data Engineering newsletter on Substack.

If you have been following trends in data engineering landscape over the past few years surely you have been hearing a lot about Open Table Formats and Data Lakehouse, if not already working with them!

But what is all the hype about table formats if they have always existed and we have always been working with tables when dealing with structured data in any application?

In this blog post, we will delve into the history and evolution of open table formats within the data landscape. We will explore the challenges that led to their inception, the key innovations that have defined them, and the impact they have had on the industry.

The origin of Table Formats

Presenting information in a two-dimensional tabular format has been the most fundamental and universal method for displaying structured data. It dates back over 3500 years to the old Babylonian period when the most ancient table data were recorded on clay tablets.

The modern concept of database tables emerged with the invention of relational databases, inspired by E.F. Codd’s paper on the Relational Model published in 1970.

Since then, table formats have been the primary abstraction for managing and working with structured data in relational database management systems, such as the pioneering System R.

Table Format Abstraction

Data tables are logical datasets, an abstraction layer over physical data files stored on disk, providing a unified, two-dimensional tabular view of records.

The storage engine combines records from various objects for a dataset and presents them as one or more logical tables to the end user.

This logical table presentation offers the advantage of hiding the physical characteristics of data from applications and users.

So, we’ve been hearing a ton about open table formats lately, but what’s the big deal? And what’s the difference between open and non-open or closed formats anyway? To figure that out, let’s dive into how a general database management system is implemented.

Relational Table Format

Before to the Big Data era and emerge of Apache Hadoop in mid 2000s, traditional Database Management Systems (DBMS) adhered to a monolithic architectural design.

This architecture has several tightly coupled, interconnected layers. Each layer has a specific function for the database’s operation. The components are combined to form a single unified system. The storage layer, in particular, managed the physical aspects of data persistence.

At the core of this structure lays the Storage Engine. This component is the lowest abstraction level, and it oversaw the physical organisation and management of data on disk.

High level DBMS architecture

What’s the implication?

The design lacked interoperability. One could not just copy database files to another system, or use a generic language like Python to process them. Nor could one point a generic query engine at the database’s OS files and interact with the data.

Given these constraints, the concept of an Open Table Format (OTF) as we understand it today was non-existent. Traditional databases employed proprietary storage formats tightly integrated with their specific implementations.

Hadoop and Big Data Revolution

Let’s fast forward from the 1970s to 2006, when the BIG DATA Revolution took place and the Apache Hadoop project was born out of Yahoo leading to disassembly of database systems.

A major architectural breakthrough was the decoupling of storage and compute. This architectural change allowed for vast data storage in common semi-structured text-based and binary formats like CSV, JSON, Avro and Parquet, on a cheap, commodity hardware managed by HDFS distributed file system.

Data could be stored much like files on a local file system and processed using distributed processing frameworks of choice like MapReduce, Pig, Hive, Impala and Presto. This was a game-changer for those accustomed to inflexible, expensive, monolithic storage systems and proprietary data warehouses.

But the real breakthrough, as stated by AMPLab co-director Michael Franklin was achieving data independence as result of the new decoupled architecture:

The real breakthrough was the separation of the logical view you have of the data and how you want to work with it, from the physical reality of how the data is actually stored.

That is why Big Data was such a Big Hype back then — a similar level of hype surrounds Generative AI today, creating a sense of déjà vu for some.

Nevertheless, the big data was a true revolution. It provided the foundation for many innovations in the open data ecosystem that followed.

1st Generation OTF — The Birth of Open Table Format

The initial release of Apache Hadoop presented significant challenges for data engineers.

Expressing data analysis and processing workloads in MapReduce logic using Java was both complex and time-consuming. Moreover, Hadoop lacked a mechanism for storing and managing schemas for datasets on its file system.

To bridge this gap, Facebook (now Meta), an early and influential Hadoop adopter, initiated the Hive project. The goal was to introduce SQL and tabular structures, familiar from traditional relational databases, into the Hadoop and HDFS ecosystem.

However a key distinction was its new architectural approach:

Being built on top of the decoupled physical layer, leveraging open data formats stored on HDFS distributed file system.

Open Table Architecture on Hadoop

Impact of Apache Hive

Facebook open-sourced Hive in 2008, making it available to the broader community. A few years later, Cloudera, a prominent Hadoop vendor, developed Apache Impala.

The introduction of Apache Hive and Impala into the Hadoop stack, the concept of open table formats built upon open file formats was born. Hive open tables, along with directory-based partitioning, became the primary abstractions for data ingestion, data modeling, and management within the Hadoop ecosystem.

Evolution of Columnar Binary File Formats

Another pivotal advancement was the development of efficient columnar open file formats. This began with RCFiles, a first-generation columnar binary serialisation framework from Apache Hive project.

Subsequent innovations included Apache ORC as an improved version of RCFile, released in 2013, and Apache Parquet, a joint effort between Twitter and Cloudera, also released in 2013.

These new open file formats dramatically enhanced the performance of OLAP-based analytical workloads on Hadoop, laying the groundwork for building OLAP storage engines directly on data lakes.

Since then, ORC and Parquet have become the de facto standard open file format for managing data at rest on data lakes, with Parquet being more popular and enjoying wider adoption and support in the ecosystem.

Next we will dive deeper into how Hive table format is structured, but before that let’s generalise the physical design the engines such as Hive and Impala use which heavily relies on the file system directory hierarchy. Lets call it directory-oriented table formats

Directory-oriented Table Formats

The straightforward way to treat data as a table in a distributed file system, like HDFS (a data lake), involves projecting a table onto a directory. That directory contains data files and possibly some sub-directories (i.e partitions).

The core principle is to organise data files in a directory tree. In essence, a table is just a collection of files tracked at the directory level, accessible by various tools and compute engines.

The important factor to note is that this architecture is inherently tied to the physical file system layout, relying on file and directory operations for data management. This has been the standard practice for storing data in data lakes since the inception of Hadoop.

Directory-oriented table format

What is the implication for Query Engines?

Since table partitions are represented as sub-directories, it becomes the responsibility of the query engines to parse and scan each partition represented as a sub-directory in order to identify the relevant data files during query planning phase.

This implies that the physical partitioning is tightly coupled with the logical partitioning on the table level with its own constraints which will be discussed later.

Now that we have covered what a directory-based table looks like, lets look at the Hive table format.

Hive Table Format

With the presented storage model, it’s fair to say that Apache Hive is a directory-oriented table format. It relies on the underlying file system’s API for mapping files to tables and partitions. Consequently, Hive is heavily influenced by the physical layout of data within the distributed file system.

Hive employs its own partitioning scheme, using field names and values to create partition directories. It manages schema, partition, and other metadata in a relational database known as the Metastore.

The Significant shift so far with Hive + Hadoop is:

Unlike traditional monolithic databases, Hadoop and Hive’s decoupled approach allows other query and processing engines to process the same data on HDFS using Hive engine’s metadata.

Following example shows a typical Hive temporal partitioning based on year, month and day.

Hive-style partitioning scheme

This leads to another major difference between the new data architecture and traditional DBMS systems: While traditional systems tightly bind data and metadata like table definitions, the new paradigm separates these components. This decoupling offers great flexibility.

Hive Metastore

Moreover, a centralised schema registry (such as the Hive Metastore, which has become the de facto standard) allows other processing engines like Spark and Presto/Trino to interact with data in a structured tabular format.

By accessing table metadata within the registry, query engines can determine file locations on the underlying storage layer, understand partitioning schemes, and execute their own read and write operations.

I hope you now understand why we refer to this design as open and perhaps begin to appreciate its flexibility and open architecture compared to previous generations of database systems.

Drawbacks and Limitations and the Directory-oriented and Hive table Format

For nearly a decade, Hive reigned supreme as the most popular table format on Hadoop platforms. Tech giants like Uber, Facebook, and Netflix heavily relied on Hive to manage their data.

However, as these companies scaled their data platforms, they encountered significant scalability and data management challenges that Hive couldn’t adequately address.

Let’s look at some of shortcomings of the directory-oriented table formats and Hive-style tables that prompted the engineers at this tech companies to seek alternatives.

  • High Dependency on Underlying File System — This architecture heavily relies on the underlying storage system to provide essential guarantees like atomicity, concurrency control, and conflict resolution. File systems lacking these properties, such as Amazon S3’s absence of atomic rename, necessitate custom workarounds.
  • File Listing Performance — Directory and file listing operations can become performance bottlenecks, particularly when executing large-scale queries. Cloud object stores like S3 impose significant limitations on directory-style listing operations (ex S3 LIST limit of 1000 objects per operation).
  • Query Planning Overhead — On distributed file systems like HDFS, query planning can be time-consuming due to the need for exhaustive file and partition listing.

Drawbacks and challenges of using Hive-style partitioning:

  • Over Partitioning — Tightly coupling physical and logical partitioning can lead to over-partitioning, especially with high-cardinality partition columns like year/month/day. This results in excessive small files, increased metadata overhead, and slower query planning due to the need to scan numerous partitions.
  • Cloud Effect — Cloud data lakes exacerbate over-partitioning issues due to API call limitations. Jobs scanning many partitions and files often encounter throttling, leading to severe performance degradation.
  • Poor Performance — Queries on Hive-style directory-based partitions can be slow without specifying the partition key for data skipping, especially with deep partition hierarchies.

Imagine a Hive table being partitioned by 20 provinces, followed by year=/month=/day=/hour= partitions. Such a table would accumulate over 1 million partitions in 6 years.

Over-partitioning issue on Hive

Drawbacks of using External Metastore

In addition to the above drawbacks, the Hive-style table using an external Metastore add more challenges into mix:

  • Performance Bottleneck — Both Hive and Impala rely on an external metadata store (typically a relational database like MySQL or PostgreSQL), which can become a performance bottleneck due to frequent communication for table operations.
  • Metadata Performance Scalability — As data volumes and partition counts grow, the Metastore becomes increasingly burdened, leading to slow query planning, increased load, and potential out-of-memory errors.
  • Single Point of Failure — The Metastore represents a single point of failure. Crashes or unavailability can cause widespread query failures.
  • Inefficient Statistics Management — Hive’s reliance on partition-level column statistics, stored in the Metastore, can hinder performance over time. Wide tables with numerous columns and partitions accumulate vast amounts of statistical data, slowing down query planning and impacting DDL commands.

A First-hand Experience

I have personally faced many of the above challenges working with Hive in production for many years. In a recent project our development team had to rename some large and wide managed Hive tables with about 10k partitions and the rename would just hang and not complete even after many hours.

After investigation I found that for each table there are about 300k statistical records stored which Hive is trying to gather details and update these records. Even after rebuilding the index on the stats table in PostgreSQL database, the issue didn’t fully get resolved.

I believe I’ve made a pretty strong case against Hive table format and its underlying directory-oriented architecture. Apache Hive has served the big data community well for nearly a decade, but its time to improve and develop something more efficient and scalable.

Transactional Guarantees on Data Lakes

Before presenting the next evolution of table formats, let’s also examine some common challenges associated with implementing database management systems on a data lake backed by distributed file systems like HDFS or object stores like S3.

These challenges are not specific to Hive or any other data management tool but are generally related to the ACID and transactional properties of traditional DBMS systems.

  • Lack of Atomicity — Writing multiple objects simultaneously within a transaction is not natively supported, hindering data integrity.
  • Concurrency Control Challenges — Concurrent modifications to files within the same directory or partition can lead to data loss or corruption due to the absence of transaction coordination.
  • Absence of Transactional Features — Data lakes build on HDFS or object stores lack built-in transaction isolation and concurrency control. Without transaction isolation, readers can encounter incomplete or corrupt data due to concurrent writes.
  • Support for Record-Level Mutations — The immutable nature of underlying storage systems prevents direct updates or deletes at the record level in data files.

Hive Transactional Tables

Hive ACID feature was the first attempt to introduce structured storage guarantees, particularly ACID transactions (Atomic, Consistent, Isolated, Durable), to the realm of immutable data lakes.

Released in Hive version 3 (2016), this feature marked a significant leap forward by providing stronger consistency guarantees like cross-partition atomicity and isolation.

But addition of ACID to Hive didn’t solve the fundamental issues because:

Hive ACID tables remained rooted in the directory-oriented approach, relying on a separate metadata store for managing table-level information within the underlying data lake storage layer.

Several attempts were made by vendors like Hortonworks and Cloudera to integrate Hive ACID into the broader data ecosystem.

Despite these efforts, I would say Hive ACID didn’t catch the imagination of the community as it failed to gain widespread adoption due to its underlying design limitations.

2nd Generation OTF — The Rise of Log-oriented Table Format

Now that we have built a strong case for re-imagining and improving the open table format model, let’s recap and list the major issues we identified with the previous generation of table formats:

  • Tight coupling between physical partitioning and the logical partitioning scheme of the data.
  • Heavy reliance on the file system or object store API for listing files and directories during the query planning phase.
  • Relying on an external metadata store for maintaining table-level information such as schemas, partitions, and column-level statistics.
  • Lack of support for record-level upsert, merge and delete.
  • Lack of ACID and transactional properties.

Let’s temporarily set aside the complexities of upsert and ACID transactions to focus on first three fundamental challenges. Given these constraints, we must consider how to decouple partitioning schemes from physical file layouts, minimise file system API calls for file and partition listings, and eliminate the reliance on an external metadata store.

To address these requirements, we need a data structure capable of efficiently storing metadata about data, partitions, and file listings. This structure must be fast, scalable, and self-contained, with no dependencies on external systems.

One solution to address these requirements is surprisingly simple, though not always the most obvious. Just as Jay Kreps and the engineering team at LinkedIn built Apache Kafka on the foundation of a simple append-only storage abstraction — an immutable log containing sequential records of events ordered by time — can we consider using a similar framework?

So the question is:

If immutable logs can store events representing facts that always remain true in systems like Apache Kafka, can’t we apply the same principle to manage the state of table’s metadata in our case?

By leveraging log files, we can treat all metadata modifications as sequentially ordered events. This aligns with the Event Sourcing data modeling paradigm.

Files and partitions become the unit of record for which the metadata layer tracks all the state changes in the log. In this design, the metadata logs are the first class citizens of the metadata layer

Lets build a Simple Log-oriented Table

Let’s do a quick practical exercise to understand how we can design our new table format to capture and organise the metadata in log files.

In this exercise we will build a simple log-oriented metadata table format for capturing the filesystem and storage-level state changes such as adding and removing files and partitions, which can provide the event log primitives such as strong ordering, versioning, time travel and replaying event to rebuild the stage.

For capturing storage-level or file system state changes we need to consider two main file system objects, that is files and directories (i.e partitions) with following possible events:

Let’s assume a particular table contains three partitions in /year=/month=/day= format. In a most simple form, the metadata log can be implemented with following fields in an immutable log file:

timestamp|object|event type|value
20231015132000|partition|add|/year=2023/month=10/day=15
20231015132010|file|add|/year=2023/month=10/day=15/00001.parquet
20231015132011|file|add|/year=2023/month=10/day=15/00002.parquet
20231015132011|file|add|/year=2023/month=10/day=15/00003.parquet

Managing Metadata Updates

Given the immutable nature of data lake storage systems, metadata logs cannot be continuously appended to. Instead, each update resulting from data manipulation operations requires the creation of a new metadata file.

To maintain sequence and facilitate table state reconstruction, these metadata logs can be sequentially named and organised within a base metadata directory.

/mytable/
/metadata_logs/
000001.log
000002.log
000003.log
000004.log

To rebuild the current table state, the log order in the metadata directory, and possibly the logs’ Timestamp field, can serve as a logical clock.

The query engines can scan the event log sequentially to replay all the metadata state change events in order to rebuild the current snapshot view of the table.

Log Compaction

Frequent data updates on large datasets can lead to a proliferation of metadata log files, as each change necessitates a new log entry.

Over time, the overhead of listing and processing these files during state reconstruction can become a performance bottleneck, negating the benefits of decoupling metadata management.

To mitigate this, a periodic compaction process can merge individual log files into a consolidated file. However, for time travel and rollback capabilities, these outdated events must be retained for a specified period.

/mytable/
/metadata_logs/
000001.log
000002.log
000003.log
000004.log
snapshot_000004.log
000005.log

In the above example, the snapshot log snapshot_000004.log has been generated for sequential log files 000001.log to 000004.log containing all the metadata transactions up to that point.

What did we just build?

We’ve successfully designed a foundational log-oriented table format that addresses our initial requirements by using simple, immutable transactional logs to manage table metadata alongside data files.

This approach serves as the bedrock for modern open table formats like Apache Hudi, Delta Lake, and Apache Iceberg.

Essentially:

The modern open table formats provide a mutable table layer on top of immutable data files through a log-based metadata layer. This design offers database-like features such as ACID compliance, upserts, table versioning, and auditing.

By abstracting the physical file layout and tracking the table state (including partitions) at the file level within the metadata layer, these formats decouple logical and physical data organisation using the log-oriented metadata layer as shown below.

Open Table Format Architecture

What about query performance?

In this architecture, the query performance is directly affected by how fast the metadata files can be retrieved and scanned during the query planning phase.

Using the underlying storage fast sequential I/O for reading metadata files provides much better performance that using their metadata APIs for gathering the required information such as the list of all sub-directories (partitions), files and retrieval of column-level statistics either from the footer section of the data files, or from the external metadata engine.

Is this a Novel Design?

The concept of using metadata files to track data files and associated metadata isn’t entirely novel.

Key-value stores like RocksDB and LevelDB employ a similar approach, using manifest files to keep track of SSTables (data segments in LSM-Tree storage model) and their corresponding key ranges.

I wonder if those smart engineers behind the modern open table formats drew any inspirations from metadata management design in storage systems like RocksDB!

Adding Additional Feature

By adopting an event log and event sourcing model, we can readily implement additional valuable primitives:

  • Event Replay — The ability to replay file and directory change event logs up to a specific version.
  • Full State Rebuild — Compute engines can reconstruct the table’s current state and identify active files and partitions by processing the metadata event log.
  • Time Travel — Similar to event-based systems, we can revert to previous table versions using the event log and versioning mechanism.
  • Event-Based Streaming Support — The transactional log inherently functions as a message queue, enabling the creation of streaming pipelines without relying on separate message buses

For table and column-level statistics we can leverage our log-based metadata layer to store additional statistical metadata for optimising query performance. we can eliminate external system interactions and extensive file footer scans entirely.

We could follow the same metadata organisation, but use different naming conventions to manage column stats index. For each new data file loaded, a new delta index log can be generated to save the column stats records.

/mytable/
/metadata_logs/
stats_000001.log
stats_000002.log
stats_000003.log
stats_000004.log
stats_snapshot_000004.log
stats_000005.log

For a more detailed discussion of the design you can refer to the original post:

Adding ACID Guarantees

A core design objective of open table formats is to enable ACID guarantees through the metadata layer. The new log-structured metadata approach inherently supports functions such as versioning and Snapshot Isolation via MVCC, addressing the previously discussed transaction isolation challenges in data lakes.

To provide Snapshot Isolation, writes can occur in following two steps:

  1. Optimistically create or replace data files on the underlying storage.
  2. Atomically update the metadata transaction log with the newly added or removed files, generating a new metadata version.

This transactional mechanism prevents readers from encountering incomplete or corrupt data, a common issue in the previous table format generation By bypassing file system listing operations, we eliminate consistency issues like list-after-write on some object stores.

All three major table formats (Hudi, Delta Lake, Iceberg) implement MVCC with snapshot isolation to provide read-write isolation and versioning.

Multi-Write Concurrency can be facilitated through Optimistic Concurrency Control (OCC), which validates transactions before committing to detect potential conflicts.

The Origin of Modern Open Table Formats implementations

As previously discussed, the current generation of open table formats emerged to address the limitations of the previous generation of data management approaches on data lakes, and the foundation of these tools lies in the log-structured metadata organisation explored earlier.

  • Apache Hudi, initiated by Uber in 2016, primarily aimed to enable scalable, incremental upserts and streaming ingestion into data lakes, while providing ACID guarantees on HDFS. Its design is heavily optimised for handling mutable data streams.
  • Apache Iceberg originated at Netflix around 2017 in response to the scalability and transactional limitations of Hive’s schema-centric, directory-oriented table format.
  • Delta Lake, introduced by Databricks in 2017 and open-sourced in 2019, emerged as the third major open table format. Its primary goal was to provide ACID transaction capabilities atop cloud object store-based data lakes.
  • Apache Paimon is another notable and fairly recent open table format developed by the Apache Flink community in 2022, as the “Flink Table Store”. The main design goal is handling high throughput and low latency streaming data ingestion. However it has yet to gain any significant traction in comparison to the dominant trio.

These projects have significantly streamlined data management for users by automating optimisations, compaction, and indexing processes. This relieves data engineers from the burden of complex low-level physical data management tasks.

Industry Adoption

The past few years have witnessed widespread adoption and integration of next-generation open table formats across various data tools and platforms.

All the major open table formats have gained traction and popularity while a fierce competition for market dominance has been going on mainly by the SaaS vendors providing these products as a managed service.

Major cloud providers have also embraced one or all of the big three formats. Microsoft is fully committing to Delta Lake for its latest OneLake and Microsoft Fabric analytics platforms.

Google has been adopting Iceberg as the primary table format for its BigLake platform. Cloudera, a leading Hadoop vendor, has also built its open data lakehouse solution around Apache Iceberg.

Prominent open source compute engines like Presto, Trino, Flink, and Spark now support reading and writing to these open table formats. Additionally, major MPP and cloud data warehouse vendors, including Snowflake, BigQuery, and Redshift, have incorporated support through external table features.

Beyond these tools and platforms, numerous companies have publicly documented their migration to open table formats.

3rd Generation OTF — Unified Open Table Format

The evolution of open table formats has marched on with a new trend since last year: cross-table interoperability.

This exciting development aims to create a unified and universal open table format that seamlessly works with all major existing formats under the hood.

Currently, converting between formats requires metadata translation and data file copying. However, since these formats share a foundation and often use Parquet as the default serialisation format, significant opportunities for interoperability exist.

A uniform metadata layer promises a unified approach for reading and writing data across all major open table formats. Different readers and writers would leverage this layer to interact with the desired format, eliminating the need for manual format-specific metadata conversion or data file duplication.

Unified Open Table Format Layer

The State of Art

LinkedIn engineers pioneered one of the earliest attempts at a unified table API with OpenHouse introduced in 2022. Built on top of Apache Iceberg, OpenHouse offered a simplified interface for interacting with tables, regardless of their underlying format, through a RESTful Table Service seamlessly integrated with Spark.

While OpenHouse was a great effort, it lacked comprehensive interoperability and format conversion capabilities. Additionally, its open-sourcing in 2024 came relatively late compared to other emerging projects that had already gained significant traction, specially with giant tech companies such as Databricks, Microsoft and Google backing following projects.

Apache XTable (formerly known as OneTable), introduced by OneHouse in 2023, provides a lightweight abstraction layer for generating metadata for any supported format using common models for schemas, partitioning details, and column statistics. In terms of metadata layout, XTable stores metadata for each format side-by-side within the metadata layer.

Databricks introduced Delta UniForm in 2023. Delta UniForm automatically generates metadata for Delta Lake and Iceberg tables while maintaining a single copy of shared Parquet data files.

It’s important to note that UniForm, primarily sponsored by Databricks, seems focused on using Delta Lake as the primary format while enabling external applications and query engines to read other formats.

How do they compare?

LinkedIn’s OpenHouse project offers more of a control plain than a unified table format layer.

Comparing Apache XTable to Delta Uniform, XTable takes a broader approach, aiming for full interoperability and allowing users to mix and match read/write features from different formats regardless of the primary format chosen.

As an example, XTable could enable incremental data ingestion into a Hudi table (leveraging its efficiency) while allowing data to be read using Iceberg format by query engines like Trino, Snowflake, or BigQuery.

That being said, we’re still in the early phases of development of uniform table format APIs. It will be exciting to see how they progress over the coming months.

Data Lakehouse

That brings us to the last part of this blog post to explore the concept of a data lakehouse without which our discussion would be incomplete. Let’s define what a data lakehouse stands for:

A data lakehouse represents a unified, next-generation data architecture that combines the cost-effectiveness, scalability, flexibility and openness of data lakes, with the performance, transactional guarantees and governance features typically associated with data warehouses

That definition sounds very similar to what open table formats stand for! That’s because the lakehouse foundation is based on leveraging open table formats for implementing ACID, auditing, versioning, and indexing directly on low-cost cloud storage, to bridge the gap between these two traditionally distinct data management paradigms.

In essence, data lakehouse enables organisations to treat data lake storage as if it were a traditional data warehouse and vice versa.

This vision was initially pursued by SQL-on-Hadoop tools to bring data warehousing to Hadoop platforms, but only getting fully realised recently with the advancements in the data landscape.

Non-Open vs Open Data Lakehouse

It’s important to differentiate between a general “data lakehouse” and an “open data lakehouse”.

While top cloud vendors like AWS and Google often label their data warehouse-centric platforms as data lakehouses, their definition is broader.

Their emphasis is on their data warehouses’ ability to store semi-structured data, support external workloads like Spark, enable ML model training, and query open data files — all characteristics traditionally associated with data lakes. These platforms also typically feature decoupled storage and compute architectures.

On the other hand, open data lakehouse primarily leverage open table formats to manage data on low-cost data lake storage.

This architecture promotes higher interoperability and flexibility, allowing organisations to select the optimal compute and processing engine for each job or workload.

By eliminating the need to duplicate and move data across systems, open data lakehouses ensure that all data remains in its original, open format, serving as a single source of truth.

Open Data Lakehouse Architecture

Vendors such as Databricks, Microsoft OneLake, OneHouse, Dremio, and Cloudera have positioned themselves as providers of managed open data lakehouse platforms on cloud.

Conclusion

This post series has covered a lot of ground, taking you on a journey through the evolution of data.

I am personally always interested in understanding how a technology came to be, the major architectural changes and evolutions it underwent, and the design goals and motivations behind it.

I hope you have enjoyed the ride and now have a better understanding of where we are in the technology timeline and how we got here.

This was scaled down version of the original two-part series published on Practical Data Engineering newsletter on Substack.

--

--

Alireza Sadeghi

Senior Data Engineer with 16 years experience building and scaling software and data platforms My Newsletter: https://practicaldataengineering.substack.com