Techniques For Periodically Extracting Data From Relational Databases

Alireza Sadeghi
10 min readSep 19, 2023

Extracting data from relational databases such as those used in Business Support Systems (BSS) is a common ETL and data extraction pattern in analytical data platforms. It’s the job of data engineers to choose the best approach to build RDBMS data ingestion pipelines, whether integrating data to a central data lake, data warehouse or data lakehouse.

There are many factors to consider when deciding how to extract data from the source system in the most efficient and optimal way. I prefer to distinguish how data is extracted from source to how data is loaded in the target platform as I see them as two separate problems to solve. This article covers common data extraction patterns focusing on how the data is extracted from a relational database, usually in batch or micro-batch mode. The logic and patterns of how the data is loaded and reconciled in the target data repository once it is extracted from the source, is not fully addressed in the presented techniques.

The decision of which extraction pattern to use for each dataset/table is largely determined by the source table profile and its properties.

Main Criteria

  • Type of the dataset (fact, dimension, immutable log, etc)
  • Size of the table (volume, deepness and wideness)
  • Availability of metadata or control columns such as incremental unique ID, created timestamp and modified timestamp
  • The rate of insert and update (how fast the data changes)

Other criteria which should be considered as well:

  • Data Latency and data freshness requirement of the business
  • Availability and cost of storage and compute resources
  • Source database engine resource availability

Techniques for detecting changed data

When dealing with immutable insert-only data (facts in data warehouse term) such as transactional records which are not modified after insert, the logic for extracting data is simpler as we don’t have to worry about updates and deletes, and we only need to detect new inserts. However when dealing with mutable data (dimensions in data warehouse term) in relational databases, in addition to detecting new inserts, we also need to detect changed records such as those updated or deleted.

There are different techniques for detecting changed data in relational databases:

(1) Query-based CDC — Using Audit Columns

  • Identify changed data using SQL queries on the source engine by filtering based on an audit column.
  • These columns are usually auto populated either by the front-end systems or database triggers. Examples are incremental Ids, create_date and modified_timestamp fields. The integrity of these columns is important and must be verified.
  • If the columns are updated by the front-end system and not database triggers, there is a risk of records being manually changed by the back-end developers or DBA’s without updating the audit columns.
  • The existence of NULL values can also raise concerns about the integrity of these columns.
  • Once the integrity of the audit column is verified, one of the most common approaches to use the modified column is to extract all records whose modified timestamp are greater than the maximum timestamp of the previous load in incremental fashion which will be discussed in the next section. However we must always approach such audit columns with cautions and never assume they are consistent and reliable until proven.
  • For immutable insert-only tables, the timestamp column or an incremental row id identifier is good enough to be used as the “maximum value column” to extract new records on each incremental load.

Pros

  • Straightforward and flexible using already familiar SQL semantics
  • No additional configuration on source system or dependencies are needed

Cons

  • This this technique doesn’t capture physical DELETES. only INSERTS and UPDATES are detected
  • Extra loads on the database engine which can cause performance issues for the front-end systems accessing the engine.
  • It is possible that not all changes are captured (ex record manually updated on the database level without timestamps being updated)

(2) Trigger-based CDC — Using Change-log Table/Files

  • If the audit columns are not available, it is possible to develop new logic on the database back-end to capture inserts and updates by means of database triggers for a set of required tables, and capture those records in a change-log table or file.

Pros

  • All changes can be captured (INSERT, DELETE, UPDATE)

Cons

  • Triggers have to be defined for all tables required. introduces additional processing and storage cost.
  • The change-log table has to be queried to extract changes which puts load on the source engine.

(3) Log-based Change-Data-Capture (CDC)

  • Using database redo logs or third-party change data capture libraries and tools it is possible to identify the new and changed records for the tables of interest. This method is usually used in event-driven and streaming workflow patterns.

Pros

  • Captures all the changes (INSERT, DELETE, UPDATE)
  • No additional processing load on the source database engine
  • Decoupling data ingestion pipelines from the source database systems

Cons

  • More complex to implement and reason about
  • Not all systems support log-based CDC (or require additional license)
  • Usually an additional third-party CDC tool (ex Debezium) and Message Broker (ex Kafka) needs to be included in the data framework which means additional complexity, maintenance and monitoring will be added to the platform.

Data Extraction Patterns

Once we have decided who to detect changes in the upstream database system we can look at the possible extraction patterns for building the required ETL pipelines.

(1) Periodic Full Reload

(Also referred to as truncate/reload in Data warehousing terms)

This technique is as simple as taking a periodic full snapshot of the source table and replacing the target dataset. This is one of the most simplest techniques for extracting mutable data which is not even considered with how to detect changes in the source table. However it is only suitable under specific circumstances such as dealing with small and slow changing dimensional data which are not updated too frequently.

Pros:

  • Simple and easy to implement
  • Effective for small tables and slow changing dimensions
  • Deletes are captured automatically
  • Schema evolution is automatically supported

Cons:

  • Scalability: Load performance degrades as source table grows larger
  • Process load times and performance of the ETL job for large tables
  • History of data changes is not preserved

It must be noted that this method doesn’t preserve historical changes of the source table as it reloads the data every time the pipeline is executed.

(2) Periodic Partial Reload

To avoid performing a full truncate-reload by taking full snapshot containing old and historical data on each pipeline execution, we can reduce the volume of the snapshot data to only contain recent data such as last 12 months, last 6 months etc for larger tables. This extraction pattern relies on availability of a timestamp attribute to be used to filter recent records. There will still be a one-time job to extract all the historical data at the beginning and only from there after the partial snapshot extraction will be scheduled.

Pros:

  • Avoids repeatedly extracting large unmodified historical data
  • Effective for small to medium tables with large historical records

Cons:

  • Adds additional ingestion complexity for de-duplication and reconciliation compared to the full snapshot method
  • Relies on a “modified timestamp” column to be always correct when a record is added or modified on the source
  • Deletes are only captured for the snapshot period covered
  • Schema validation might be required

You might ask why would one choose this approach instead of simply implementing an incremental pipeline only extracting the latest changed data (which is presented as another technique in this article). To answer, there might be scenarios where extracting data using a fixed period might be more suitable and reliable such as when a large table is partitioned only yearly or monthly and using the monthly partition as the predicate could provide a more efficient and faster ETL operation than using a non-partitioned timestamp when reading data from the source engine.

Additionally, when there is a degree of doubt and unreliability in the source system and the fear that the system admins or DBAs might manually change recent data in the backend when corruption or data quality issues happen, one might rather decide to extract data with a longer period as a protection against such incidents.

(3) Periodic Snapshots

This technique is similar to Technique (1) with the difference that instead of truncating and reloading the target dataset, each time the full snapshot is extracted from the source system, it is inserted into a new versioned partition on the target data repository.

This method is simple and additionally provides the advantage of preserving history compared to “Periodic Full Reload” method. However it comes at a cost of large data duplication and redundancy. Therefore it is not suitable for large and fast changing datasets or dimensions.

Pros:

  • Simple and easy to implement
  • Effective for small tables and slow changing dimensions
  • Deletes are captured automatically
  • Historical data changes are preserved
  • Good performance on read and write

Cons:

  • Data duplication and storage inefficiency is the main downside
  • Not suitable for fast changing dimensions
  • Dimension needs to be created from one process only
  • Requirements to refresh dimensions on a more frequent cadence than daily will also amplify duplication

This technique requires computing the snapshot usually at the query time, or rely on a separate pipeline to build the current state in a separate dataset and therefore preserving two datasets, one containing all the history with duplicate data, and one dataset containing only the current state.

Due to maintaining large amount of duplicate data, usually data retention policies should be in-place to periodically discard older data to reclaim space.

(4) Incremental Inserts

Provided that the source table is immutable, meaning that it is an insert-only table with no updates or modifications, and there is an incremental row id/primary-key or insert timestamp, the new records can be ingested into the central data platform incrementally using the Maximum Value Column.

This is one of the most common approaches for incremental ingestion of transactional data. It’s straight forward technique and since we don’t have to worry about updates we only need to keep track of the last extracted record from source.

Pros:

  • Reduces ingestion workload as only new records are extracted from the source
  • Suitable for large immutable log-based datasets

Cons:

  • It relies on and assumes the source system’s Maximum Value Column such as incremental row id /primary-key or create-date timestamp to always be correct and inserted
  • Deletes are not captured if it matters in the target systems and downstream data consumers

(5) Incremental Upserts

Incremental inserts works when we are dealing with immutable insert-only data such as transactional data. However for mutable data, provided there is a modified timestamp column on the source table, we can still use the same technique as the “audit column” to identify and only ingest the new and changed records on incremental basis into the data lake.

The only difference is that instead of insert, upsert has to be done on the target dataset to eliminate duplicates. There are different upsert and de-duplication techniques such as using SQL upsert statement which won’t be covered in this article.

Pros:

  • Reduces ingestion workload as only new and changed records are extracted from the source
  • Suitable for large and fast changing tables

Cons:

  • Extra logic has to be added to the pipeline for de-duplication and replacement of the changed records in the target dataset. Modern table formats such as Apache Hudi or Iceberg could be implemented to take care of the logic under the hood.
  • It relies on the assumption that the modified timestamp column is always updated when a record is changed
  • Deletes are not captured automatically if required

(6) Hybrid — Incremental Insert/Upsert + Periodic Reload

In a scenario when incremental Insert or Upsert pattern is required but the audit or maximum value column cannot be fully trusted to always be updated automatically, a hybrid pattern can be employed where the is a incremental job scheduled for every x hours to ingest new/updated records from the source, and additionally another periodic job with a higher latency such as daily or weekly is scheduled to ingest a full or partial snapshot of the source data to rectify any missing or outdated records ingested by the first incremental job.

Of course the business has to accept the risk that until the full snapshot of data is ingested at the period it is scheduled for, the incremental data ingested might not be 100% accurate but it can still be a viable option in environments were strict governance over source data is not exercised and manual modification of records is possible, which presents the risk of missing the most recent version of modified data on the target system.

Following figure shows an example of performing hourly incremental load , while doing a full snapshot with truncating and reloading the target dataset on a daily basis.

Pros:

  • Reduce the workload by avoiding full snapshot ingestion of large source tables within short schedule periods
  • Eliminate the mentioned data integrity risks related to incremental insert or upsert

Cons:

  • More complex to implement due to having to manage two distinct pipelines and the dependencies between them

Conclusion

The presented techniques are primarily used when batch and micro-batch data ingestion patterns are needed. That being said there are other mechanisms such as using Change Data Capture (CDC) for ingesting data from relational databases into the central data platform in a more real-time and event-based architectures. However even when using event-based architectures there might still be scenarios where the source system doesn’t support CDC or its not a viable option and therefore the engineers need to look at alternative techniques for extracting and loading data from such source systems into the central data repository.

--

--

Alireza Sadeghi

Senior Software and Data Engineer [big data, distributed storage ,distributed processing, data pipelines, infraustructure, cluster management, workflow orch]