BigLake Tables: Unified Data Storage and Analytics

Introduction BigLake external tables

This article introduces BigLake and assumes database tables and IAM knowledge. To query data in supported data storage, build BigLake tables and query them using GoogleSQL:

Create Cloud Storage BigLake tables and query.
Create BigLake tables in Amazon S3 and query.
Create Azure Blob Storage BigLake tables and query.

BigLake tables provide structured data queries in external data storage with delegation. Access delegation separates BigLake table and data storage access. Data store connections are made via service account external connections. Users only need access to the BigLake table since the service account retrieves data from the data store. This allows fine-grained table-level row- and column-level security. Dynamic data masking works for Cloud Storage-based BigLake tables. BigQuery Omni explains multi-cloud analytic methods integrating BigLake tables with Amazon S3 or Blob Storage data.

Support for temporary tables

BigLake Cloud Storage tables might be temporary or permanent.

Amazon S3/Blob Storage BigLake tables must last.
Source files multiple
Multiple external data sources with the same schema may be used to generate a BigLake table.

Cross-cloud connects

Query across Google Cloud and BigQuery Omni using cross-cloud joins. Google SQL JOIN can examine data from AWS, Azure, public datasets, and other Google Cloud services. Cross-cloud joins prevent data copying before queries.

BigLake table may be used in SELECT statements like any other BigQuery table, including in DML and DDL operations that employ subqueries to obtain data. BigQuery and BigLake tables from various clouds may be used in the same query. BigQuery tables must share a region.

Cross-cloud join needs permissions

Ask your administrator to give you the BigQuery Data Editor (roles/bigquery.dataEditor) IAM role on the project where the cross-cloud connect is done. See Manage project, folder, and organization access for role granting.

Cross-cloud connect fees

BigQuery splits cross-cloud join queries into local and remote portions. BigQuery treats the local component as a regular query. The remote portion constructs a temporary BigQuery table by performing a CREATE TABLE AS SELECT (CTAS) action on the BigLake table in the BigQuery Omni region. This temporary table is used for your cross-cloud join by BigQuery, which deletes it after eight hours.

Data transmission expenses apply to BigLake tables. BigQuery reduces these expenses by only sending the BigLake table columns and rows referenced in the query. Google Cloud propose a thin column filter to save transfer expenses. In your work history, the CTAS task shows the quantity of bytes sent. Successful transfers cost even if the primary query fails.

One transfer is from an employees table (with a level filter) and one from an active workers table. BigQuery performs the join after the transfer. The successful transfer incurs data transfer costs even if the other fails.

Limits on cross-cloud join

The BigQuery free tier and sandbox don’t enable cross-cloud joins.
A query using JOIN statements may not push aggregates to BigQuery Omni regions.
Even if the identical cross-cloud query is repeated, each temporary table is utilized once.
Transfers cannot exceed 60 GB. Filtering a BigLake table and loading the result must be under 60 GB. You may request a greater quota. No restriction on scanned bytes.
Cross-cloud join queries have an internal rate limit. If query rates surpass the quota, you may get an All our servers are busy processing data sent between regions error. Retrying the query usually works. Request an internal quota increase from support to handle more inquiries.
Cross-cloud joins are only supported in colocated BigQuery regions, BigQuery Omni regions, and US and EU multi-regions. Cross-cloud connects in US or EU multi-regions can only access BigQuery Omni data.
Cross-cloud join queries with 10+ BigQuery Omni datasets may encounter the error “Dataset was not found in location “. When doing a cross-cloud join with more than 10 datasets, provide a location to prevent this problem. If you specifically select a BigQuery region and your query only includes BigLake tables, it runs as a cross-cloud query and incurs data transfer fees.
Can’t query _FILE_NAME pseudo-column with cross-cloud joins.
WHERE clauses cannot utilize INTERVAL or RANGE literals for BigLake table columns.
Cross-cloud join operations don’t disclose bytes processed and transmitted from other clouds. Child CTAS tasks produced during cross-cloud query execution have this information.
Only BigQuery Omni regions support permitted views and procedures referencing BigQuery Omni tables or views.
No pushdowns are performed to remote subqueries in cross-cloud queries that use STRUCT or JSON columns. Create a BigQuery Omni view that filters STRUCT and JSON columns and provides just the essential information as columns to enhance speed.
Inter-cloud joins don’t allow time travel queries.

Connectors

BigQuery connections let you access Cloud Storage-based BigLake tables from other data processing tools. BigLake tables may be accessed using Apache Spark, Hive, TensorFlow, Trino, or Presto. The BigQuery Storage API enforces row- and column-level governance on all BigLake table data access, including connectors.

In the diagram below, the BigQuery Storage API allows Apache Spark users to access approved data:

The BigLake tables on object storage

BigLake allows data lake managers to specify user access limits on tables rather than files, giving them better control.

Google Cloud propose utilizing BigLake tables to construct and manage links to external object stores because they simplify access control.

External tables may be used for ad hoc data discovery and modification without governance.

Limitations

BigLake tables have all external table constraints.
BigQuery and BigLake tables on object storage have the same constraints.
BigLake does not allow Dataproc Personal Cluster Authentication downscoped credentials. For Personal Cluster Authentication, utilize an empty Credential Access Boundary with the “echo -n “{}” option to inject credentials.
Example: This command begins a credential propagation session in myproject for mycluster:

gcloud dataproc clusters enable-personal-auth-session \
    --region=us \
    --project=myproject \
    --access-boundary=<(echo -n "{}") \
    mycluster

The BigLake tables are read-only. BigLake tables cannot be modified using DML or other ways.

These formats are supported by BigLake tables:

Avro
CSV
Delta Lake
Iceberg
JSON
ORC
Parquet

BigQuery requires Apache Iceberg’s manifest file information, hence BigLake external tables for Apache Iceberg can’t use cached metadata.
AWS and Azure don’t have BigQuery Storage API.

The following limits apply to cached metadata:

Only BigLake tables that utilize Avro, ORC, Parquet, JSON, and CSV may use cached metadata.
Amazon S3 queries do not provide new data until the metadata cache refreshes after creating, updating, or deleting files. This may provide surprising outcomes. After deleting and writing a file, your query results may exclude both the old and new files depending on when cached information was last updated.
BigLake tables containing Amazon S3 or Blob Storage data cannot use CMEK with cached metadata.

Secure model

Managing and utilizing BigLake tables often involves several organizational roles:

Managers of data lakes. Typically, these administrators administer Cloud Storage bucket and object IAM policies.
Data warehouse managers. Administrators usually edit, remove, and create tables.
A data analyst. Usually, analysts read and query data.

Administrators of data lakes create and share links with data warehouse administrators. Data warehouse administrators construct tables, configure restricted access, and share them with analysts.

Performance metadata caching

Cacheable information improves BigLake table query efficiency. Metadata caching helps when dealing with several files or hive partitioned data. BigLake tables that cache metadata include:

Amazon S3 BigLake tables
BigLake cloud storage

Row numbers, file names, and partitioning information are included. You may activate or disable table metadata caching. Metadata caching works well for Hive partition filters and huge file queries.

Without metadata caching, table queries must access the external data source for object information. Listing millions of files from the external data source might take minutes, increasing query latency. Metadata caching lets queries split and trim files faster without listing external data source files.

Two properties govern this feature:

Cache information is used when maximum staleness is reached.
Metadata cache mode controls metadata collection.

You set the maximum metadata staleness for table operations when metadata caching is enabled. If the interval is 1 hour, actions against the table utilize cached information if it was updated within an hour. If cached metadata is older than that, Amazon S3 or Cloud Storage metadata is retrieved instead. Staleness intervals range from 30 minutes to 7 days.

Cache refresh may be done manually or automatically:

Automatic cache refreshes occur at a system-defined period, generally 30–60 minutes. If datastore files are added, destroyed, or updated randomly, automatically refreshing the cache is a good idea. Manual refresh lets you customize refresh time, such as at the conclusion of an extract-transform-load process.
Use BQ.REFRESH_EXTERNAL_METADATA_CACHE to manually refresh the metadata cache on a timetable that matches your needs. You may selectively update BigLake table information using subdirectories of the table data directory. You may prevent superfluous metadata processing. If datastore files are added, destroyed, or updated at predetermined intervals, such as pipeline output, manually refreshing the cache is a good idea.

Dual manual refreshes will only work once.
The metadata cache expires after 7 days without refreshment.
Manual and automated cache refreshes prioritize INTERACTIVE queries.

To utilize automatic refreshes, establish a reservation and an assignment with a BACKGROUND job type for the project that executes metadata cache refresh tasks. This avoids refresh operations from competing with user requests for resources and failing if there aren’t enough.

Before setting staleness interval and metadata caching mode, examine their interaction. Consider these instances:

To utilize cached metadata in table operations, you must call BQ.REFRESH_EXTERNAL_METADATA_CACHE every 2 days or less if you manually refresh the metadata cache and set the staleness interval to 2 days.
If you automatically refresh the metadata cache for a table and set the staleness interval to 30 minutes, some operations against the table may read from the datastore if the refresh takes longer than 30 to 60 minutes.

Tables with materialized views and cache

When querying structured data in Cloud Storage or Amazon S3, materialized views over BigLake metadata cache-enabled tables increase speed and efficiency. Automatic refresh and adaptive tweaking are available with these materialized views over BigQuery-managed storage tables.

Integrations

BigLake tables are available via other BigQuery features and gcloud CLI services, including the following.

Hub for Analytics

Analytics Hub supports BigLake tables. BigLake table datasets may be listed on Analytics Hub. These postings provide Analytics Hub customers a read-only linked dataset for their project. Subscribers may query all connected dataset tables, including BigLake.

BigQuery ML

BigQuery ML trains and runs models on BigLake in Cloud Storage.

Safeguard sensitive data

BigLake Sensitive Data Protection classifies sensitive data from your tables. Sensitive Data Protection de-identification transformations may conceal, remove, or obscure sensitive data.