Reverse ETL: Exporting Data from BigQuery to Bigtable

BigQuery to Bigquery

The distinction between databases and analytics has become more hazy as a result of AI and real-time data integration in modern applications, which have introduced data analytics platforms like BigQuery into operational systems. Because BigQuery can easily integrate several data sources, augment data with AI and ML, and manipulate warehouse data directly using Pandas, it is preferred by customers. Additionally, they state that in order to have an operational system that can handle large datasets with millisecond query performance, BigQuery pre-processed data must be made available for rapid retrieval.

In order to bridge analytics and operational systems and enable real-time query latency, the EXPORT DATA to Bigtable (reverse ETL) tool is now widely available. With Bigtable's extremely performant data format, anyone with SQL skills can now instantly translate their BigQuery research, access it with single-digit millisecond latency and high QPS, and replicate it globally to be closer to customers.

Three use cases and architectures where automated on-demand This blog post explains how to export data from BigQuery to Bigtable:

Serving applications in real time
ML-enhanced streaming data
Building real-time metrics that rely on huge data by backloading data drawings

Serving applications in real time

BigQuery is improved for real-time applications by Bigtable. OLAP queries for counting and aggregation are optimized by BigQuery's storage format. Your most utilized data is cleverly cached by BigQuery BI Engine to expedite ad hoc analysis for real-time applications. BigQuery search indexes can also be used for text lookups to locate JSON and other rows without keys that need text filtering.

Unlike Bigtable, BigQuery is a diversified analytics platform that is not designed for real-time application servicing. With OLAP-based storage, it can be challenging to access several columns in a row or range of rows. Bigtable is perfect for operational applications because of its superior data storing capabilities.

Use Bigtable as a serving layer if any of the following are required by your application:

Row lookups with single-digit millisecond response times that are consistent and predictable
High query rate (scales linearly with nodes)
Low latency application writing
Worldwide deployments (automatic replication of data close to consumers)

By transferring warehouse table data to a real-time architecture with ease, reverse ETL lowers query latency.

Step 1: Assemble the service table and big table.

To create a Bigtable instance a container for Bigtable data follow the directions. When creating this instance, you have to decide between SSD or HDD storage. If you're just learning Bigtable, HDD can save money, but SSD is faster and better for production. When you create an instance, you also construct your first cluster. The region in which you are loading the BigQuery dataset must coincide with this cluster. Clusters in other areas that automatically receive data from BigQuery's writing cluster can be added, though.

Once your instance and cluster are ready, create your Bigtable table, which serves as the BigQuery sink in the reverse ETL process. From the console, select Tables from the left navigation panel, then select Create Table from the top of the Tables screen.

On the Create a Table screen, just enter the Table ID BQ_SINK and click "create." Enabling BigQuery Reverse ETL construct column families was the third step.

You can also run cbt createtable BQ-SINK after connecting to your instance via CLI.

Step 2: Make a BigQuery Reverse ETL application profile

Request handling is managed by Bigtable app profiles. Think about separating the export of BigQuery data into a separate app profile. In order to put your data in the same region as BigQuery, enable single-cluster routing in this profile. In order to prevent interfering with the primary flow of your Bigtable application, it should also have low priority.

The Bigtable App Profile created by this gcloud command has the following configurations:

gcloud bigtable app-profiles create BQ_APP_PROFILE \
–project=[PROJECT_ID] \
–instance=[INSTANCE_ID]\
–description=”Profile for BigQuery Reverse ETL” \
–route-to=[CLUSTER_IN_SAME_REGION_AS_BQ_DATASET] \
–transactional-writes \
–priority=PRIORITY_LOW

Once this command has been executed, Bigtable ought should display it in the Application profiles section.

Step 3: SQL-export data for applications

Let's examine BigQuery and prepare the findings for their use in artwork. The_met.objects table from BigQuery public datasets will be used. Each Met artwork's structural metadata is included in this table. It seeks to produce two primary components for art applications:

Artist profile: A brief, organized document that contains artist details for quick access within our application.
Gen AI artwork description: Gemini uses Google Search for context and metadata from the table to create a narrative description of the artwork.

Setting up Gemini in BigQuery

Set up the integration if this is your first time using Gemini with BigQuery. Use these procedures to establish a connection with Vertex AI. To connect a dataset model object to the remote Vertex connection, use the BigQuery statement that follows:

CREATE MODEL [DATASET].model_cloud_ai_gemini_pro
REMOTE WITH CONNECTION us.bqml_llm_connection
OPTIONS(endpoint = ‘gemini-pro’);

Step 4: Use Bigtable's low-latency serving table in a GoogleSQL query

Pre-processed artwork data can be used via its mobile app. Bigtable Studio and Editor are available from the navigation menu on the left side of the Bigtable console. Test the low-latency serving query for your application using this SQL.

select _key, artist_info,
generated_description[‘ml_generate_text_llm_result’] as generated_description
from BQ_SINK

This Bigtable SQL statement creates the text description field your application requires and provides an artist profile as a single object. Bigtable client libraries for C++, C#, Go, Java, HBase, Node.js, PHP, Python, and Ruby can be used to integrate this serving table.

Using Dataflow and Bigtable to enhance streaming machine learning data

Using BigQuery-Bigtable Reverse ETL to feed ML inference models with historical data, such as customer purchase history, is another well-known use case. Models for fraud detection, recommendation systems, and other applications can be constructed using BigQuery's historical data. It may be possible to provide context to clickstream data utilized in a recommendation system by knowing a customer's shopping basket or whether they have browsed similar items. Additional details, such as the location of a previous purchase, the number of recent transactions, or the status of a travel notification, are needed to identify a fraudulent in-store credit card transaction. Bigtable enables real-time, high-throughput addition of historical data to Kafka or PubSub event data.

To accomplish this, use Dataflow with the built-in Enrichment transform in Bigtable. These structures can be constructed with just a few lines of code!

Backloading data sketches

All the information required to extract a result, continue it, or combine it with another sketch for re-aggregation is contained in a data sketch, which is a concise synopsis of a data aggregation. In data sketching, Bigtable's conflict-free replicated data types (CRDT) facilitate data counting over a distributed system. Analytics, machine learning, and real-time event stream processing all depend on this.

Aggregations of traditional distributed systems are challenging to manage since accuracy usually suffers from speed, and vice versa. Distributed counting using Bigtable aggregate data types is accurate and efficient. By using mathematical properties to guarantee that these updates converge to the correct final value regardless of order, these tailored column families enable each server to update its local counter independently without performance-impairing locks. Operational reporting, customisation, and fraud detection all require these aggregation data types.

These data types easily integrate with BigQuery Data Sketches (when the same sketch type is present in Bigtable) and EXPORT DATA. This is crucial if you want to update a real-time counter with updates from a source other than streaming ingestion or backload your initial application with historical data.

To take advantage of this functionality, simply use a command to add an aggregate column family and export the data. An example of the app's code:

You can run the HLL_COUNT on Bigtable and add real-time updates to this batch update.Utilizing BigQuery's historical data, the data sketch's EXTRACT SQL function is used to estimate the number of artists.

What comes next?

In real-time applications, query latency is decreased via reverse ETL between BigQuery and Bigtable, but more is required! It is working on continuous queries and real-time architecture data freshness. While in preview, continuous queries allow you to replicate BigQuery data into Bigtable and other sources. StreamingDataFrames is ready for testing and can be used with Python transforms in BigFrames.