Datastream’s BigQuery Append only

 

Datastream Documentation

DataStream’s BigQuery append only CDC makes historical data monitoring in BigQuery easier. Organizations frequently struggle to maintain both an up-to-date “source of truth” and the capacity to follow the whole history of changes made to their data. Change data capture (CDC) is a popular method for managing this data in operational databases like MySQL or PostgreSQL. It replicates the changes to a cloud data warehouse like BigQuery.

A new feature called BigQuery append only mode has been added to Data stream, Google Cloud‘s serverless CDC service. This feature makes it easier to replicate changes from your operational databases to BigQuery. This feature provides a productive and economical method of tracking changes to operational data over time and keeping track of historical records.

BigQuery append only

When a record is modified or removed in your source database using conventional CDC-based replication, the matching record in the destination is rewritten, making it challenging to trace the history of changes. This problem is solved by appending only, which maintains each modification as a new row in your BigQuery target table. You can use the metadata to sort and filter the data as needed. Each row contains the type of modification (insert, update, or delete), a unique identifier, a timestamp, and other pertinent information.

BigQuery append only Applications

Append only database

BigQuery append only mode comes in very handy in situations when you have to keep a history of changes. Common usage scenarios include the following:

  • Auditing and compliance: Keep track of any data modifications for internal audits or regulatory compliance.
  • Trend analysis: Examine past data to find trends, anomalies, and patterns throughout time.
  • Customer 360: Monitor changes in customer data to keep an all-encompassing picture of client interactions and preferences.
  • Examining embedding drift: You can examine how embeddings have changed over time and evaluate the effects on the functionality of your model by using a historical record of embeddings.
  • Time travel: Conduct historical analysis and comparisons by querying your data warehouse as it existed at a certain moment in time.

As an illustration

Let’s say you require MySQL to serve as your main source of truth since you keep customer information in a MySQL table. For the purpose of analyzing behavior and preferences, your analytics team must monitor modifications to client records. All inserts, updates, and deletes made to this table when append-only mode is enabled will be documented as new rows in the related BigQuery table. This makes it easier to get the data that the data analytics team needs for analysis.

The append-only mode’s advantages

Economy of cost: minimizes processing expenses by simply adding new rows rather than combining existing data with complicated merge processes.

Increased data accuracy: Reduces the possibility of data loss by guaranteeing a thorough and accurate history of modifications.

Real-time insights: Allows for the examination of changes in real-time as they happen, which speeds up decision-making.

Using append-only mode: A guide

Enabling BigQuery append only mode during the stream creation process through the user interface or API is a simple process. To enable you to track changes, DataStream automatically creates BigQuery tables with the necessary metadata fields.

BigQuery destination configuration

Configure destination datasets

Configuring datasets for BigQuery allows you to choose from the following:

  • The dataset for each schema is selected or produced in BigQuery based on the source schema name. Therefore, Datastream automatically produces a BigQuery dataset for each schema in the source.
  • Datastream creates datasets in the stream project if you select this option.
  • Datastream produces the mydb dataset and employees table in BigQuery from a MySQL source.
  • One dataset for all schemas: BigQuery datasets can be chosen for streams. This dataset receives all data from DataStream. For the selected dataset, DataStream produces all tables as _.
  • If your MySQL source has a mydb database and an employees table, Datastream generates mydb_employeestable in the dataset you select.

Write behaviour

Each table sent to BigQuery has a STRUCT field named datastream_metadata.

If a table has a source primary key, the column contains these fields:

  • UUID: String data type.
  • This field is integer.
    The column has IS_DELETED if a table has no primary key.
  • This parameter is true and specifies whether Datastream streams data from a DELETE action at the source.
  • Append-only tables lack primary keys.
  • BigQuery allows 20 MB events per stream.

Change writing mode

You can choose how DataStream writes change data to BigQuery when configuring your stream:

  • Merge: The default write mode. BigQuery displays your source database data when selected. Datastream sends all data updates to BigQuery, which consolidates them with existing data to create replicas of the source tables. Merge mode erases change history.
  • When you insert and update a row, BigQuery only keeps the modified data. Once you delete the row from the source table, BigQuery no longer stores it.
  • Append-only: In append-only write mode, BigQuery receives a stream of modifications (INSERT, UPDATE-INSERT, UPDATE-DELETE, and DELETE events). Use this mode to preserve data history.

BigQuery append only write mode

Consider the following scenarios to understand BigQuery append only write mode:

  • BigQuery writes two rows when a primary key changes.
  • UPDATE-DELETE row with main key-original.
  • UPDATE-INSERT row with new main key.
  • One UPDATE-INSERT row is written to BigQuery when you update a row.
  • A single DELETE row is written to BigQuery when you delete a row.

BigQuery Destination Table

Using BigQuery tables with max_staleness

DataStream employs BigQuery’s built-in upsert functionality for near-real-time ingestion, updating, inserting, and removing data. Upset operations update BigQuery destinations as rows are added, updated, or deleted. DataStream uses BigQuery Storage Write API to upset these operations into the destination table.

Set data staleness limit

BigQuery modifies sources in the background or at query run time based on the data staleness limit. DataStream sets the table’s max_staleness option to the stream’s data staleness limit when creating a BigQuery table.

Control BigQuery costs

BigQuery and DataStream expenses are separate. See BigQuery CDC pricing for BigQuery cost control.

In brief

The BigQuery append only mode of DataStream is a useful complement to Google Cloud‘s data integration and replication features. Businesses are enabled to obtain more insights from their data, enhance data accuracy, and optimize their data pipelines by means of a streamlined change data gathering process and a comprehensive history of modifications. See the documentation for further information.

Post a Comment

0 Comments