• Home
  • Contact

SUPERGLOO

  • Stream ProcessingStart Here
  • SparkSpark Tutorials
    • Spark Tutorials With Scala
    • Spark Tutorials with Python
  • KafkaKafka Tutorials
  • Books
    • Clean Code Summary
    • Mythical Man Month Summary
    • Learning Spark Summary
    • Pragmatic Programmer Summary
    • Spark Tutorials with Scala
    • Data Science from Scratch Summary
  • Courses
    • Scala for Spark Course
    • Spark with Scala Course
    • Spark with Python Course
  • Contact

Change Data Capture – What, Why, How

Change Data Capture

Change Data Capture Overview

Change Data Capture (CDC) refers to the determination and tracking of changes made to the data in databases. CDC is utilized to extract and transport changes in online transactional processing (OLTP) data systems to downstream systems such as data warehouses, data lakes and/or stream processors.

In essence, CDC is implemented by writing to immutable transaction logs and then providing a mechanism to read from these logs. Recording transactions to log(s) minimize performance impact on the transaction database.  Turning on and configuring CDC also requires no change to tables or application.

Let’s consider a high-level example.  In an E-commerce application, all new, updated or deleted order transactions would be recorded via CDC to enable our streaming applications to consume and publish to downstream analytic systems, recommendation engines and/or fraud detection processors.  In our example, CDC is implemented without impacting the performance experience of the E-commerce end-user.

In the following sections, we’ll run through some more examples for emphasis.

Change Data Capture Vendor Examples

Change Data Capture is freely available out-of-the-box from database vendors such as Microsoft SQL Server, Oracle, PostgreSQL, and mySQL.

In Microsoft SQL Server, CDC records insert, edit and delete activities in the SQL Server table in a detailed format [1]. Column information, along with the metadata, is captured for the modified rows. Then, it is stored in append-only change tables. Table-valued functions enable users to have a systematic access to these change data tables.  Records within the change tables are immutable and therefore similar to the value of an immutable log.

Batch based Extraction, Transformation, and Loading (ETL) and/or streaming ELT applications may utilize the captured CDC data.  Let’s look at another example.

In Oracle, it is possible to capture and publish changed data in synchronous and asynchronous modes [2]. In synchronous mode, change data is captured as part of the transaction that modifies the source table. This mode uses triggers on the source database to capture change data. Change data is captured in real time on the source database. SYNC_SOURCE is a single, predefined synchronous change source that cannot be altered. Synchronous mode is cost-efficient, though it adds overhead to the source database at capture time.  In asynchronous mode, change data is captured from the database redo log files after changes have been made to the source database.

In mySQL, CDC is available as part of the binlog which was originally used in auditing and copying data to other mySQL systems [3]. But, the binlog may be utilized outside of mySQL for CDC events processing and saving to downstream analytics systems for example.

In PostgreSQL, change data capture is possible in either transaction logs or triggers. In transaction logs, all the write transactions (i.e., INSERT, UPDATE, DELETE, DDL’s) are written to the Write Ahead Logs (WAL) before the transaction result is sent to the user or client [4].  The WAL allows consumption of CDC events.

When to Consider Change Data Capture 

Change data capture plays a significant role in streaming data processing and pipelines. As the amount of data grows rapidly, the need for CDC techniques becomes crucial in order to handle data inflow for analytics and near real-time analytics such as machine learning and artificial intelligence (AI).

Problems arise when long-running, taxing analytic queries are introduced on OLTP systems which affect overall application performance. By using CDC-based technologies, it allows users to capture database mutations such as inserts, updates, and deletions automatically to process in streaming applications and/or to store in destinations suited for analytic queries.

CDC implementations contain the metadata necessary to understand the changes made. CDC technologies reduce cost and enable the improvement of data quality, accuracy, and provide a mechanism to create streaming architectures. CDC is a solution for continuous and accelerating growth in data volumes, reducing load time, resources and cost.

Change data capture summary list of advantages:

  • CDC implementation does not require application changes.
  • CDC only requires to configure database and objects.
  • CDC enables the user to identify change history.
  • CDC enables the user to add context information to every DML if required.
  • CDC has auto cleanup feature, deleting information automatically based on the retention period.

 

Change Data Capture has its own disadvantages:

  • CDC does not track change time.
  • CDC does not track security context.
  • CDC does not save how the data changed; it only tracks that a change was made.
  • CDC may add slight overhead to the system, depending on the number of changes.

 

Options to CONSUME Change Data Capture

Once you have configured CDC in Oracle, SQL Server, you may be wondering how can you consume and process the CDC events?  How can you implement a streaming architecture with CDC?  What are your options for building CDC stream processors?  etc.

At the time of this writing, here are a few options to consider and not listed in any particular order

  1. Debezium – Open Source.  Built on Kafka Connect framework. All of Debezium’s connectors are Kafka Connector source connectors so they have the pros and cons associated with Kafka Connect.
  2. StreamSets – Open Source.  Out of the box support for all CDC implementations described here as well as others.
  3. Attunity – Proprietary.

 

In addition, you may consider vendor-specific options as well such as

  1. Oracle Golden Gate
  2. Dynamo DB Streams

 

Conclusion

Change data capture is available out-of-the-box in many database systems such as MS SQL Server, Oracle, mySQL, Postgres, Mongo (opLog), Dynamo (DynamoDB Streams), etc.  CDC is prevalent in streaming architectures when implementing separation of concerns between transactions and other concerns such as analytics, search indexing, machine learning, AI, near real-time monitoring and alerting.

Hope this helps!

Let me know if you have any questions or concerns.

 

References

[1] SQL Server https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-2017

[2] Oracle CDC https://docs.oracle.com/cd/E11882_01/server.112/e25554/cdc.htm

[3] MySQL binlog https://dev.mysql.com/doc/internals/en/binary-log-overview.html

[4] PostgreSQL WAL https://www.postgresql.org/docs/11/runtime-config-wal.html

Image credit https://pixabay.com/en/deer-dream-animal-fantasy-1333814/

 

Dec 4, 2018Todd M
Kafka Connect mySQL ExamplesHow to Debug Scala Spark in IntelliJ
You Might Also Like
 
How-To Apache Spark Streaming with Scala Part 1
 
Spark Streaming with Kafka Example

Leave a Reply Cancel reply

Todd M

Provider of tutorials, training, and other learning resources for Data Engineers, Data Scientists, and Data Architects. I created a few courses and books.

2 months ago Streamingcdc, change data capture, streaming156
Categories
  • Kafka
  • Spark
  • Streaming
  • Summary Series
Recent Posts
  • Kafka Streams – Transformations Examples February 13, 2019
  • Kafka Producer January 29, 2019
  • Kafka Consumer January 27, 2019
Most Commented
Spark SQL mySQL JDBC
Spark SQL MySQL Example with JDBC
19 Comments
Spark Cluster on EC2
How To: Apache Spark Cluster on Amazon EC2 Tutorial
13 Comments
Intellij Scala Spark
IntelliJ Scala and Apache Spark – Well, Now You Know
8 Comments
Tags
scalaspark tutorialpythonstreamingspark sqlbooksummaryspark pythonapache sparkkafka streamslearningsparksparkclusterintellijspark apachemachine learningchange data capturekinesisarchitecturekafka connectcassandracdc
  • Privacy Policy
  • Terms of Use
Most Viewed
Spark Transformation Examples
Apache Spark: Examples of Transformations
47,544 views
Intellij Scala Spark
IntelliJ Scala and Apache Spark – Well, Now You Know
42,187 views
Spark Streaming with Kafka
Spark Streaming with Kafka Example
36,301 views
Recent Posts
  • Kafka Streams – Transformations Examples
  • Kafka Producer
  • Kafka Consumer
  • Kafka Streams – Why Should You Care?
  • Kafka Streams Joins Examples
2019 © Supergloo