Internal Storage Design of Modern Key-value Database Engines [Part 1]

8 min readAug 14, 2023

In this multi-part article I will describe the physical storage design behind many modern popular single-node, embedded and distributed key-value stores such as Amazon DynamoDB, Apache Cassandra, Riak, ScyllaDB, LevelDB and RocksDB starting with a simple storage model and progressively introducing enhancements for making it more understandable for readers with different knowledge levels.

This part will be dedicated to providing an overview of key-value store data model and presenting a simple preliminary physical storage abstraction.

Key-value Data Model

Key-value data model is one of the simplest data models used in many database engines powering large-scale application. Large-scale database engines using key-value as their primary storage model have been around for more than a decade and getting more attention with the proliferation of NoSQL database engines.

In this model, data is considered as a set of key-value pairs. Keys are unique IDs similar to primary key in relational data model, to uniquely identify a record and also serve as an index. Values are attributes or objects containing the actual data.

Key-value stores are usually suitable for applications where the records are primarily saved and looked up by a unique key such as user ID rather than lookup by many attributes as in relational data model. Therefore usually secondary keys and indexes are not supported in Key-value data stores. Such storage engines are also inherently not suitable for OLAP and analytical workloads which require optimization techniques such as partitioning and secondary indexes.

Brief History of Modern Key-value Stores

Amazon’s DynamoDB key-value store [1] and Google BigTable [2] are probably the main pioneers of modern distributed key-value storage engines. In 2007 Amazon published the design behind its Dynamo key-value store [3] demonstrating a novel way for implementing scalable and efficient method for building key-value database engines optimized for high throughput and low latency writes based on LSM-Tree storage structure (more on this later in this post).

A year later Google BigTable design paper was published in 2008 [4] which together with Dynamo’s distributed design principles provided the foundation for next generation of modern scalable key-value storage engines.

The first successful open sourced project inspired and designed based on the internal design demonstrated by Dynamo and BigTable was implemented by Facebook (Meta) who published the design behind it’s distributed key-value store Apache Cassandra in 2010 [5].

With the success and adoption of Apache Cassandra database by industry and open source community, many new key-value stores emerged being inspired by the same internal design principles. The Original Dynamo’s physical and distributed systems principles became so popular that the community and even authors use the term dynamo-style storage engines to refer to such key-value stores.

Some notable new generation of such key-value stores include:

Riak [6] open source distributed key-value store developed by Basho Technologies in 2009 and open sourced later on.
Google’s LevelDB [7] embedded key-value store open sourced in 2011 which was itself inspired by Google’s BigTable internal storage design.
RocksDB [8] built by Facebook in 2012 and open sourced in 2013 inspired by Google’s LevelDB in return.
Other notable open source examples include Voldemort [9] distributed key-value store developed by LinkedIn (the open source project is currently inactive) and ScyllaDB [10] for which the development started in 2014 and open sourced in 2015, being compatible and designed based on Apache Cassandra’s design principles.

Another term which is used to refer to such storage engines is LSM-Tree based engines based on the internal storage design of Google BigTable, which will be the central discussion of this article.

Storage Design

In the sections to follow I will discuss and go over the internal physical storage design implemented in many modern key-value stores. I will first present simpler storage model and progressively introduce enhancements until we get to the final design.

Log-Structured Data Structure

The main storage design used in many the modern key-value stores is based on a hybrid two-level memory hierarchy model composed of mainly two (or more) data structures, one residing in memory and the other on-disk data structure as shown in the following diagram.

In-memory data structure is much smaller compared to on-disk data structure since it will only hold newly written entries until some size threshold is reached and the a rolling merge process takes place to flush the entries to the on-disk data structure in a single run. Any read request will first check the in-memory data structure and then on-disk data if not found.

Figure 1 — hybrid two-level memory hierarchy

Both in-memory and on-disk data structures are append-only sequential log-oriented data structures. Such a log-structured storage model was probably first introduced by a research at UC Berkeley in 1992 for designing an efficient file system using this technique [11].

The main design goal was to improve write performance by buffering sequence of file changes in an in-memory cache and then write the small changes sequentially to the on-disk log structure in a single operation which can optimize write performance considerably by eliminating almost all random disk seeks. All data is permanently stored in the log files, therefore log is the only data structure on disk.

Defer and Batching

The main goal behind using such a hybrid log-based data structure is using the principles of defer and batching. That is reducing the cost of I/O by deferring costly random writes by means of first buffering them in memory, and later write them out sequentially (flushing) to persistent storage in batch efficiently by taking advantage of the more efficient sequential I/O.

Therefore by eliminating the seek and rotational cost of random I/O required in popular B-tree updates used in conventional database systems, which has no batching effect, and requires two I/O operations to find and read a leaf node and then update and write it out again for each update. This would also improve write throughput when doing group commit instead of performing lots of small mutations on disk-resident data structure [4] . As mentioned the original idea of deferring writes to batch and insert multi-page blocks to disk was inspired by Log-Structured File System devised by Rosenblum and Ousterhout [11].

In the following section the data structure of each level (main memory and disk) will be explained further and later the full design will be discussed putting all the puzzles together.

On-Disk Storage Structure

In the simplest form the on-disk section of this storage model consists of multiple immutable append-only log files called segments in which records are inserted to the log sequentially. Therefore no in-place updates are allowed. In such data structure all changes such as new writes, updates and deletes are treated as inserts.

Using immutable append-only data structure offers the following advantages [12,13]:

Immutable data structures provide simpler programming model since few functions are applicable as apposed to mutable one which require finding and performing in-place update.
Random writes are fast as immutable storage structures are optimized for write performance as apposed to mutable data which are optimized for read, since all writes are appended sequentially with no in-place lookup and replace overhead.
Dealing with challenges with mutable data such as crash recovery, concurrency control and hierarchical locks and latches are avoided, as read and write operations don’t intersect and segment locking is not needed.
Due to sequential data layout in segment files fragmentation as present in mutable structures is avoided.

However as with any other design choice there are tradeoffs and implications related to using an immutable data structure such as:

Since updates are not allowed, multiple versions of the same key or record might be captured which would require reconciliation and merging multiple versions during read or compaction process.
Random reads and lookup operations are less efficient due to reconciliation process required to return the most recent version of requested data often by checking multiple files. Other data structures and techniques such as Bloom filters need to be implemented to improve random read latency.
A housekeeping operation is required to periodically perform compaction by merging and reconciling the immutable segments for optimization as well as removing redundant data and freeing up unused spaces.

Considering the presented tradeoffs, using such a data structure is still an attractive design for data intensive applications which are dominated by heavy random write workloads and suitable to use key-value data model.

Log Segments

As mentioned in the previous section so far in our simple on-disk storage design we a sequential append-only log which is divided into smaller manageable chunks called segments. Breaking the log into smaller segments has the advantage of:
- Avoid running out of space
- Better storage management with suitable fixed-size file
- Partition distribution in distributed storage systems

So far on our key-value storage design we have an immutable on-disk log segments which records are appended to a log segment and when the log reaches a record or size threshold a new segment is created:

However while our preliminary design so far can meet the needs of insensitive write throughput and latency by simply flushing records in batch to underlying log segments in sequential way and eliminating the need for any random I/O seek operations, it fails to deliver an acceptable key lookup performance as in a worst case scenario all segments have to be scanned to lookup a key in our on-disk data structure. Therefore key lookup and read operations will be very slow specially overlarge datasets without using any additional optimization and indexing techniques. Additionally key-range lookup requests are not readily supported.

Therefore we need an improved storage design to overcome such limitations. The proposed improvement by Google BigTable for the on-disk storage design will be presented in the next part.

References

[1] https://aws.amazon.com/dynamodb/

[2] https://cloud.google.com/bigtable

[3] DeCandia, Giuseppe, et al. “Dynamo: Amazon’s highly available key-value store.” ACM SIGOPS operating systems review 41.6 (2007): 205–220.

[4] Chang, Fay, et al. “Bigtable: A distributed storage system for structured data.” ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 1–26.

[5] Lakshman, Avinash, and Prashant Malik. “Cassandra: a decentralized structured storage system.” ACM SIGOPS operating systems review 44.2 (2010): 35–40.

[6] https://riak.com/products/riak-kv/index.html

[7] https://github.com/google/leveldb

[8] https://rocksdb.org/

[9] https://www.project-voldemort.com/voldemort/

[10] https://www.scylladb.com/

[11] Rosenblum, Mendel, and John K. Ousterhout. “The design and implementation of a log-structured file system.” ACM Transactions on Computer Systems (TOCS) 10.1 (1992): 26–52

[12] Kleppmann, Martin. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. “ O’Reilly Media, Inc.”, 2017.

[13] Petrov, Alex. Database Internals: A deep dive into how distributed data systems work. O’Reilly Media, 2019.