Iceberg Table Replication using Amazon Data Firehose

Firehose for Amazon Data

Dependable real-time stream loading into data lakes, warehouses, and analytics services.

Streaming data is easy to capture, manipulate, and load. You may select your destination, setup a delivery stream, and start streaming data in real time with only a few clicks.
Automatically provide and scale processor, memory, and network resources without ongoing supervision.
You can transform raw streaming data into formats like Apache Parquet and dynamically segment streaming data without building your own processing pipelines.

How it works

Amazon Data Firehose provides the most straightforward way to extract, convert, and distribute data streams to analytics services, data lakes, and data warehouses in a matter of seconds. To use Amazon Data Firehose, you must first set up a stream with a source, destination, and any required changes. Amazon Data Firehose processes the stream continuously, scaling automatically based on the amount of data available and delivering the results in a few seconds.

Source

Select the source of your data stream, such as a topic in Amazon Managed Streaming for Kafka (MSK), a stream in Kinesis Data Streams, or write data using the Firehose Direct PUT API. Because Amazon Data Firehose is integrated with over 20 AWS services, you can create a stream from sources such as Amazon CloudWatch Logs, AWS WAF web ACL logs, AWS Network Firewall Logs, Amazon SNS, or AWS IoT.

Transformation of Data (optional)

You can decide whether you want to decompress the data, use your own AWS Lambda function to perform custom data transformations, convert your data stream into formats like ORC or Parquet, or dynamically divide input records according to attributes to transmit to different locations.

The location

Select an endpoint for your stream, such as a custom HTTP endpoint, Amazon Redshift, Amazon OpenSearch Service, Amazon S3, Splunk, or Snowflake.

Use cases

Flow into data lakes and warehouses

Stream data into Amazon S3 and convert it into the formats required for analysis without building processing pipelines.

Boost security

To monitor network security in real time and send out alerts when potential threats arise, use supported Security Information and Event Management (SIEM) solutions.

Develop ML streaming apps.

Improve your data streams with machine learning (ML) models to assess data and predict inference endpoints as streams travel to their destination.

To replicate database updates to Apache Iceberg tables (in preview), use Amazon Data Firehose.

Today marks the preview release of a new functionality in Amazon Data Firehose that replicates changes made to Apache Iceberg tables on Amazon Simple Storage Service (Amazon S3) and records changes made to databases such as PostgreSQL and MySQL.

Apache Iceberg is a great open-source table format for big data analysis. In addition to adding the simplicity and reliability of SQL tables to S3 data lakes, Apache Iceberg enables open-source analytics engines such as Apache Spark, Apache Flink, Trino, Apache Hive, and Apache Impala to run concurrently with the same data.

This new feature provides a simple, end-to-end method of streaming database updates without compromising the transaction efficiency of database applications. You may simply set up a Data Firehose stream to send change data capture (CDC) updates from your database. You can now easily duplicate data from many databases into Iceberg tables on Amazon S3, giving you access to up-to-date data for large-scale analytics and machine learning (ML) applications.

Hundreds of databases are used by typical Amazon Web Services (AWS) enterprise clients for transactional applications. In order to perform large-scale analytics and machine learning on the most recent data, they want to document database changes, such as the addition, modification, or deletion of records in a table, and send the updates to their data warehouse or Amazon S3 data lake in open source table formats like Apache Iceberg.

To do this, many clients develop extract, transform, and load (ETL) procedures to regularly read data from databases. However, ETL readers slow down database transactions, and batch processes can take hours before data is ready for analytics. To reduce the effect on database transaction performance, customers look for the possibility to stream database changes. This stream is known as a change data capture (CDC) stream.

For the initial setup and testing of such systems, a number of open-source components must be installed and configured. Weeks or days may go by. The engineers' requirement to verify and apply open source updates, as well as to monitor and maintain clusters after they are put up, adds to the operational overhead.

The new data streaming capability in Amazon Data Firehose allows CDC streams from databases to be continually replicated to Apache Iceberg tables on Amazon S3. The source and destination of a Data Firehose stream are specified. Data Firehose records an initial data snapshot and all subsequent changes made to the selected database tables into a data stream that is continually replicated. By obtaining CDC streams from the database replication log, Data Firehose reduces the impact on database transaction performance.

No matter how often the quantity of database updates varies, AWS Data Firehose automatically separates the data and retains records until they are transported to their destination. Capacity provisioning, cluster management, and fine-tuning are optional. In addition to the data itself, Data Firehose can automatically create Apache Iceberg tables with the same schema as the database tables at the initial creation of the Data Firehose stream. Additionally, it has the ability to dynamically expand the target schema by, for instance, adding new columns in reaction to modifications made to the source schema.

Since Data Firehose is a completely managed service, you don't have to install software updates, use open source components, or pay for overhead.

A simple, scalable, end-to-end managed solution for sending CDC streams into your data lake or data warehouse so you can do machine learning and in-depth analysis is provided by Amazon Data Firehose. In order to accomplish this, it replicates database updates to Apache Iceberg tables in Amazon S3 continuously.

Things to consider

Here are some additional considerations.

This new capability supports self-managed PostgreSQL and MySQL databases on Amazon EC2 as well as the following databases on Amazon RDS:

Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition
Amazon RDS for MySQL with Amazon Aurora MySQL-Compatible Edition
The team will continue to add support for additional databases during the trial period and after it becomes generally available. They told me that there are already plans to support MongoDB, Oracle, and SQL Server databases.

Data Firehose uses AWS PrivateLink to connect to databases in your Amazon Virtual Private Cloud (Amazon VPC).

When setting up an Amazon Data Firehose delivery stream, you can either specify specific tables and columns or use wildcards to define a class of tables and columns. If the destination contains tables and columns that match the wildcard and are added to the database after the Data Firehose stream is formed, Data Firehose will automatically create them.

Accessibility

The new data streaming functionality is now available in all AWS regions, with the exception of the Asia Pacific (Malaysia), AWS GovCloud (US), and China regions.

Pricing for Amazon Data Firehose

There are no usage fees at the beginning of the preview. In the future, your real usage such as the quantity of bytes read and supplied will determine the pricing. There are no contracts or up-front fees. Read the pricing page to find out more.