AWS Big Data Blog

How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

This guest post was co-authored with Kostas Diamantis from Skroutz.

At Skroutz, we are passionate about our product, and it is always our top priority. We are constantly working to improve and evolve it, supported by a large and talented team of software engineers. Our product’s continuous innovation and evolution lead to frequent updates, often necessitating changes and additions to the schemas of our operational databases.

When we decided to build our own data platform to meet our data needs, such as supporting reporting, business intelligence (BI), and decision-making, the main challenge—and also a strict requirement—was to make sure it wouldn’t block or delay our product development.

We chose Amazon Redshift to promote data democratization, empowering teams across the organization with seamless access to data, enabling faster insights and more informed decision-making. This choice supports a culture of transparency and collaboration, as data becomes readily available for analysis and innovation across all departments.

However, keeping up with schema changes from our operational databases, while updating the data warehouse without constantly coordinating with development teams, delaying releases, or risking data loss, became a new challenge for us.

In this post, we share how we handled real-time schema evolution in Amazon Redshift with Debezium.

Solution overview

Most of our data resides in our operational databases, such as MariaDB and MongoDB. Our approach involves using the change data capture (CDC) technique, which automatically handles the schema evolution of the data stores being captured. For this, we used Debezium along with a Kafka cluster. This solution enables schema changes to be propagated without disrupting the Kafka consumers.

However, handling schema evolution in Amazon Redshift became a bottleneck, prompting us to develop a strategy to address this challenge. It’s important to note that, in our case, changes in our operational databases primarily involve adding new columns rather than breaking changes like altering data types. Therefore, we have implemented a semi-manual process to resolve this issue, along with a mandatory alerting mechanism to notify us of any schema changes. This two-step process consists of handling schema evolution in real time and handling data updates in an asynchronous manual step. The following architectural diagram illustrates a hybrid deployment model, integrating both on-premises and cloud-based components.

End-to-end data migration workflow from on-premises databases to AWS cloud using CDC, messaging, and data warehouse services

The data flow begins with data from MariaDB and MongoDB, captured using Debezium for CDC in near real-time mode. The captured data is streamed to a Kafka cluster, where Kafka consumers (built on the Ruby Karafka framework) read and write them to the staging area, either in Amazon Redshift or Amazon Simple Storage Service (Amazon S3). From the staging area, DataLoaders promote the data to production tables in Amazon Redshift. At this stage, we apply the slowly changing dimension (SCD) concept to these tables, using Type 7 for most of them.

In data warehousing, an SCD is a dimension that stores data, and though it’s generally stable, it might change over time. Various methodologies address the complexities of SCD management. SCD Type 7 places both the surrogate key and the natural key into the fact table. This allows the user to select the appropriate dimension records based on:

  • The primary effective date on the fact record
  • The most recent or current information
  • Other dates associated with the fact record

Afterwards, analytical jobs are run to create reporting tables, enabling BI and reporting processes. The following diagram provides an example of the data modeling process from a staging table to a production table.

Database schema evolution: staging.shops to production.shops with added temporal and versioning columns

The architecture depicted in the diagram shows only our CDC pipeline, which fetches data from our operational databases and doesn’t include other pipelines, such as those for fetching data through APIs, scheduled batch processes, and many more. Also note that our convention is that dw_* columns are used to catch SCD metadata information and other metadata in general. In the following sections, we discuss the key components of the solution in more detail.

Real-time workflow

For the schema evolution part, we focus on the column dw_md_missing_data, which captures schema evolution changes in near real time that occur in the source databases. When a new change is produced to the Kafka cluster, the Kafka consumer is responsible for writing this change to the staging table in Amazon Redshift. For example, a message produced by Debezium to the Kafka cluster will have the following structure when a new shop entity is created:

{
  "before": null,
  "after": {
    "id": 1,
    "name": "shop1",
    "state": "hidden"
  },
  "source": {
    ...
    "ts_ms": "1704114000000",
    ...
  },
  "op": "c",
  ...
}

The Kafka consumer is responsible for preparing and executing the SQL INSERT statement:

INSERT INTO staging.shops (
  id,
  "name",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 13:00:00',
    'create',
    NULL
  )
;

After that, let’s say a new column is added to the source table called new_column, with the value new_value.
The new message produced to the Kafka cluster will have the following format:

{
  "before": { ... },
  "after": {
    "id": 1,
    "name": "shop1",
    "state": "hidden",
    "new_column": "new_value"
  },
  "source": {
    ...
    "ts_ms": "1704121200000"
    ...
  },
  "op": "u"
  ...
}

Now the SQL INSERT statement executed by the Kafka consumer will be as follows:

INSERT INTO staging.shops (
  id,
  "name",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 15:00:00',
    'update',
    JSON_PARSE('{"new_column": "new_value"}') /* <-- check this */
  )
;

The consumer performs an INSERT as it would for the known schema, and anything new is added to the dw_md_missing_data column as key-value JSON. After the data is promoted from the staging table to the production table, it will have the following structure.

Production.shops table displaying temporal data versioning with creation, update history, and current state indicators

At this point, the data flow continues running without any data loss or the need for communication with teams responsible for maintaining the schema in the operational databases. However, this data might not be easily accessible for the data consumers, analysts, or other personas. It’s worth noting that dw_md_missing_data is defined as a column of the SUPER data type, which was introduced in Amazon Redshift to store semistructured data or documents as values.

Monitoring mechanism

To track new columns added to a table, we have a scheduled process that runs weekly. This process checks for tables in Amazon Redshift with values in the dw_md_missing_data column and generates a list of tables requiring manual action to make this data available through a structured schema. A notification is then sent to the team.

Manual remediation steps

In the aforementioned example, the manual steps to make this column available would be:

  1. Add the new columns to both staging and production tables:
ALTER TABLE staging.shops ADD COLUMN new_column varchar(255);
ALTER TABLE production.shops ADD COLUMN new_column varchar(255);
  1. Update the Kafka consumer’s known schema. In this step, we just need to add the new column name to a simple array list. For example:
class ShopsConsumer < ApplicationConsumer
  SOURCE_COLUMNS = [
    'id',
    'name',
    'state',
    'new_column' # this one is the new column
  ]
 
  def consume
    # Ruby code for:
    #   1. data cleaning
    #   2. data transformation
    #   3. preparation of the SQL INSERT statement
 
    RedshiftClient.conn.exec <<~SQL
      /*
        generated SQL INSERT statement
      */
    SQL
  end
end
  1. Update the DataLoader’s SQL logic for the new column. A DataLoader is responsible for promoting the data from the staging area to the production table.
class DataLoader::ShopsTable < DataLoader::Base
  class << self
    def load
      RedshiftClient.conn.exec <<~SQL
        CREATE TABLE staging.shops_new (LIKE staging.shops);
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        /*
          We move the data to a new table because in staging.shops
          the Kafka consumer will continue add new rows
        */
        ALTER TABLE staging.shops_new APPEND FROM staging.shops;
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        BEGIN;
          /*
            SQL to handle
              * data deduplications etc
              * more transformations
              * all the necessary operations in order to apply the data modeling we need for this table
          */
 
          INSERT INTO production.shops (
            id,
            name,
            state,
            new_column, /* --> this one is the new column <-- */
            dw_start_date,
            dw_end_date,
            dw_current,
            dw_md_changed_at,
            dw_md_operation,
            dw_md_missing_data
          )
          SELECT
            id,
            name,
            state,
            new_column, /* --> this one is the new column <-- */
            /*
              here is the logic to apply the data modeling (type 1,2,3,4...7)
            */
          FROM
            staging.shops_new
          ;
 
          DROP TABLE staging.shops_new;
        END TRANSACTION;
      SQL
    end
  end
end
  1. Transfer the data that has been loaded in the meantime from the dw_md_missing_data SUPER column to the newly added column and then clean up. In this step, we just need to run a data migration like the following:
BEGIN;
 
  /*
    Transfer the data from the `dw_md_missing_data` to the corresponding column
  */
  UPDATE production.shops
  SET new_column = dw_md_missing_data.new_column::varchar(255)
  WHERE dw_md_missing_data.new_column IS NOT NULL;
 
  /*
    Clean up dw_md_missing_data column
  */
  UPDATE production.shops
  SET dw_md_missing_data = NULL
  WHERE dw_md_missing_data IS NOT NULL;
 
END TRANSACTION;

To perform the preceding operations, we make sure that no one else performs changes to the production.shops table because we want no new data to be added to the dw_md_missing_data column.

Conclusion

The solution discussed in this post enabled Skroutz to manage schema evolution in operational databases while seamlessly updating the data warehouse. This alleviated the need for constant development team coordination and removed risks of data loss during releases, ultimately fostering innovation rather than stifling it.

As the migration of Skroutz to the AWS Cloud approaches, discussions are underway on how the current architecture can be adapted to align more closely with AWS-centered principles. To that end, one of the changes being considered is Amazon Redshift streaming ingestion from Amazon Managed Streaming for Apache Kafka (Amazon MSK) or open source Kafka, which will make it possible for Skroutz to process large volumes of streaming data from multiple sources with low latency and high throughput to derive insights in seconds.

If you face similar challenges, discuss with an AWS representative and work backward from your use case to provide the most suitable solution.


About the authors

Konstantina Mavrodimitraki is a Senior Solutions Architect at Amazon Web Services, where she assists customers in designing scalable, robust, and secure systems in global markets. With deep expertise in data strategy, data warehousing, and big data systems, she helps organizations transform their data landscapes. A passionate technologist and people person, Konstantina loves exploring emerging technologies and supports the local tech communities. Additionally, she enjoys reading books and playing with her dog.

Kostas Diamantis is the Head of the Data Warehouse at Skroutz company. With a background in software engineering, he transitioned into data engineering, using his technical expertise to build scalable data solutions. Passionate about data-driven decision-making, he focuses on optimizing data pipelines, enhancing analytics capabilities, and driving business insights.