BigQuery Tables For Apache Iceberg Optimize Open Lakehouse Storage

 

BigQuery Tables

BigQuery tables for Apache Iceberg were used to optimize storage for the open lakehouse. For several years, enterprise-level data management features including streaming ingestion, ACID transactions, and automated storage optimizations have been supported by BigQuery native tables. Many BigQuery clients store data in data lakes using open-source file formats like Apache Parquet and table formats like Apache Iceberg.


In 2022, Google Cloud launched BigLake tables, allowing users to benefit from BigQuery's speed and security while maintaining a single copy of their data. Since BigLake tables are currently read-only, BigQuery clients must manually schedule data maintenance and perform data modifications using external query engines. Another challenge is the "small files problem" during intake. Because cloud object storage cannot permit appends, table writes must be micro-batched, requiring trade-offs between efficiency and data integrity.

BigQuery tables for Apache Iceberg, a fully managed storage engine from BigQuery that integrates with Apache Iceberg and delivers features like clustering, high-throughput streaming ingestion, and autonomous storage optimizations, are first available on Google Cloud. Although it uses the Apache Iceberg format to store data in customer-owned cloud storage buckets, it offers the same feature set and user experience as BigQuery native tables. Using BigQuery tables for Apache Iceberg, Google is bringing ten years of BigQuery advancements to the lakehouse.

BigQuery tables for Apache Iceberg can be written from BigQuery using the GoogleSQL data manipulation language (DML), and BigQuery's Write API enables high-throughput streaming ingestion from open-source engines like Apache Spark. Here's an illustration of how to create a table using clustering:

CREATE TABLE mydataset.taxi_trips
CLUSTER BY vendor_id, pickup_datetime
WITH CONNECTION us.myconnection
OPTIONS (
storage_uri=’gs://mybucket/taxi_trips’,
table_format=’ICEBERG’,
file_format=’PARQUET’
)
AS SELECT * FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2020;

The lakehouse's fully managed enterprise storage

BigQuery tables' shortcomings for Apache Iceberg

BigQuery tables for Apache Iceberg overcome the shortcomings of open-source table formats. Using BigQuery tables for Apache Iceberg eliminates the need for client effort by having BigQuery take care of table-maintenance tasks automatically. To maintain table optimization, BigQuery automatically re-clusters data, removes garbage from files, and merges smaller files into the proper file sizes.

For instance, the optimal file sizes are adaptively determined based on the table's size. More than a decade of experience in effectively and economically managing automatic storage optimization for BigQuery native tables is leveraged by BigQuery tables for Apache Iceberg. VACUUM and OPTIMIZE don't require human execution.

Vortex, an exabyte-scale structured storage system that powers the BigQuery storage write API, is used by BigQuery tables for Apache Iceberg to provide high-throughput streaming ingestion. BigQuery tables for Apache Iceberg contain recently ingested tuples in a row-oriented fashion and convert them to Parquet on a regular basis. High-throughput ingestion and parallel readings are made possible via the open-source Spark and Flink BigQuery connectors. By feeding data into BigQuery tables for Apache Iceberg using Pub/Sub and Datastream, you may avoid maintaining specialized infrastructure.

Benefits of Apache Iceberg's BigQuery tables


BigQuery's scalable metadata management solution for Apache Iceberg tables is where table metadata is kept. BigQuery maintains fine-grained data and manages metadata through data management techniques and distributed query processing. Because they are not constrained by the requirement to commit the data to object storage, BigQuery tables for Apache Iceberg might undergo more changes than table formats. Because writers cannot directly change the transaction log, the table data is impenetrable and has a reliable audit history.

The fine-grained security restrictions enforced by the storage APIs are still supported by BigQuery tables for Apache Iceberg, even while support for governance policy management, data quality, and end-to-end lineage is expanded via Dataplex.

The metadata is exported into cloud storage Iceberg snapshots using BigQuery tables for Apache Iceberg. The link to the latest exported data will soon be registered by BigQuery metastore, a serverless runtime metadata service that was unveiled earlier this year. The data can be directly queried from Cloud Storage to Iceberg metadata outputs by any engine that can understand Iceberg.

Learn more

The advantages of employing BigQuery tables for Apache Iceberg as their BigQuery storage layer that is compatible with Apache Iceberg are recognized by clients like HCA Healthcare, one of the largest healthcare organizations in the world, which opens up new lakehouse use-cases. A preview of the BigQuery tables for Apache Iceberg is now available in all Google Cloud regions. 

Post a Comment

0 Comments