BigQuery Omni Reduces Multi-cloud Log Analysis, Ingestion Costs

 

Azure BigQuery Omni

BigQuery Omni: What is it?

BigQuery Omni is a multi-cloud data analytics solution that enables you to run BigQuery analytics on data stored in Amazon Simple Storage Service (Amazon S3) or Azure Blob Storage using BigLake tables. You can learn from your data regardless of where it is kept since it provides a unified interface for studying data from several public clouds, eliminating the need to move data.

Many companies keep their data on many public clouds. This data often remains compartmentalized since it is hard to extract insights from all of the data. You need a multi-cloud data tool that is rapid, inexpensive, and does not add to the expenses of decentralized data governance in order to assess the data. BigQuery Omni lets us reduce these frictions with a single interface.

To run BigQuery analytics on your external data, you must connect to Amazon S3 or Blob Storage. To query external data, you would need to create a BigLake table that references data from Amazon S3 or Blob Storage.

Other options include cross-cloud joins and cross-cloud transfer, which allow data to be queried across clouds. The flexibility to duplicate data as required and the ability to analyze data where it is stored are two cross-cloud analytics options that BigQuery Omni offers.

BigQuery Omni on Google

In today's data-centric businesses, running hundreds of distinct apps across many platforms is not uncommon. For log analytics, the massive volume of logs produced by these apps presents a significant challenge. Furthermore, the growing usage of multi-cloud solutions complicates accuracy and retrieval since it may be more difficult to extract useful insights due to the distributed form of the logs.

Unlike a conventional approach, BigQuery Omni was developed to assist in resolving this issue and reducing overall costs. In this blog article, we will discuss the details.

There are many stages involved in log analysis, including:

Collecting log data: collects log data from the enterprise's infrastructure and/or applications. Saving this data in JSONL file format in an object storage application, such as Google Cloud Storage, is a common way to collect it. Moving raw log data across clouds in a multi-cloud architecture might be unaffordable.

Log data normalization: Different applications and infrastructures generate different JSONL files. Each file contains information unique to the infrastructure or program that created it. These several areas are merged into one to facilitate data analysis, allowing data analysts to conduct comprehensive and efficient environmental assessments.

Indexing and storage: Normalized data should be efficiently stored to save storage and query costs and enhance query performance. A compressed columnar file format, like Parquet, is often used to store logs.

Querying and visualization: Give businesses the ability to use analytics queries to look for known threads, anomalies, or anti-patterns in the log data.

Data lifecycle: As log data ages, its value decreases even while storage costs remain constant. Costs must be minimized by establishing a data lifecycle process. Logs are typically deleted after a year and archived after a month (it is uncommon to query log data older than a month). This approach effectively manages storage costs while guaranteeing that important data is always accessible.

A shared architecture

There are benefits and drawbacks to this design

 Positively:


Data lifecycle: Data lifecycle management may be simply accomplished by using the pre-existing capability from object storage systems. For example, with Cloud Storage, you may provide the following data lifecycle policy: The following policies are available for use: Any JSONL files that were accessible throughout the Collection process will be deleted if you (a) delete any item older than a week; if you (b) archive any object older than a month, your Parquet files will also be deleted; and if you (c) delete any object older than a year, your Parquet files will also be deleted.

Reduced egress costs: You may avoid sending a lot of raw data back and forth between cloud providers by keeping the data locally.

From a negative standpoint:

Log data normalization: Using the logs you collect, you will develop and oversee an Apache Spark workload for each application. It is advisable to avoid this at a time when (a) engineers are hard to come by and (b) the usage of microservices is growing rapidly.

Querying: If you distribute your data among many cloud providers, you won't be able to do as much analysis and visualization.

Querying: WHERE clauses are not a straightforward way to exclude archived files written earlier in the data lifetime, and they need human error to avoid partitions with archived files. One method to use Iceberg Table is to manage the table's manifest by adding and deleting divisions as needed. Playing with the Iceberg Table manifest by hand is challenging, however, and relying on a third-party solution simply drives up costs.

BigQuery Omni, which is shown in the architecture below, would be a superior solution to all of these problems.

The fundamental benefit of this approach is that it eliminates the need for software developers to build and manage many Spark workloads. Another benefit of this approach is that, apart from storage and visualization, a single product (BigQuery) handles the whole process. You also benefit from cost reductions. Each of these aspects will be discussed in further depth below.

A simplified process to normalize

One useful aspect of BigQuery is its capacity to automatically ascertain the schema of JSONL files and provide an external table linking to them. When dealing with various log schema formats, this method is really useful. Any application's JSONL content may be accessed by making a basic CREATE TABLE statement.

Once there, you can configure BigQuery to export the JSONL external table as compressed Parquet files segmented into hourly segments in Hive format. The query below displays an EXPORT DATA statement that can be configured to run once per hour. Only the log data that was consumed in the last hour is recorded by the SELECT statement of this query, which then converts it into a Parquet file with normalized fields.

DECLARE rounded_string STRING hour_ago;

Announcing the hour_ago_rounded_timestamp INTERVAL 1 HOUR, DEFAULT DATETIME_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP()), HOUR);
Hour_ago_rounded_timestamp, "UTC"; SET (hour_ago_rounded_string) = (SELECT AS STRUCT FORMAT_TIMESTAMP("%Y-%m-%dT%H:00:00Z")

options for exporting data (uri = CONCAT('[MY_BUCKET_FOR_PARQUET_FILES]/ingested_date=', hour_ago_rounded_string, '/logs-*.parquet'), format = "PARQUET," compression = "GZIP," overwrite = true)
[MY_NORMILIZED_FIELDS] AS (SELECT] A consistent querying process for all cloud service providers, with the exception of (ingested_date) FROM [MY_JSONL_EXTERNAL_TABLE] as jsonl_table WHERE DATETIME_TRUNC(jsonl_table.timestamp, HOUR) = hour_ago_rounded_timestamp

While querying is already improved by utilizing the same data warehouse platform across several cloud providers, BigQuery Omni's cross-cloud join capabilities are revolutionary for Log Analytics. Before BigQuery Omni, combining log data from many cloud providers was challenging. Because of the amount of data, sending the raw data to a single master cloud provider results in significant egress costs; nevertheless, your ability to do analytics on the data is limited by pre-processing and filtering it. You can run a single query across many clouds and analyze the results using cross-cloud joins.
Lowers TCO

The last and most important benefit of this design is its capacity to reduce total cost of ownership (TCO). This may be measured in three ways:

Reduced engineering resources This process excludes Apache Spark for two reasons. The first is that you don't need a software engineer to work on and maintain Spark code. The log analytics team may expedite the deployment process by using conventional SQL queries. BigQuery and BigQuery Omni, which are PaaS, extend the shared responsibility idea to data in AWS and Azure.

Reduced computational resources: Apache Spark may not always provide the most cost-effective environment. An Apache Spark solution consists of the virtual machine (VM), the Apache Spark platform, and the application itself. BigQuery employs slots (virtual CPUs, not virtual machines) as opposed to Apache Spark, thus an export query that is converted into C-compiled code during the export procedure may result in faster performance for this specific operation.
Reduced egress costs: By processing data in-situ and egressing only results via cross-cloud joins, BigQuery Omni removes the need to move raw data across cloud providers in order to get a consolidated picture of the data.

How should BigQuery be used in this situation?

For query execution, BigQuery offers two compute price models:

Pricing on demand (per TiB): The first 1 TiB of query data processed each month is free under this pricing model, which costs you based on the number of bytes each query processes. Because log analytics jobs consume a lot of data, it is not recommended to employ this approach.

Capacity price (per slot-hour): In this pricing model, you are charged for the number of slots (virtual CPUs) of processing power used to carry out queries over time. This model makes use of BigQuery editions. Slot commitments are less costly than on-demand and let you to leverage the BigQuery autoscaler. Slot commitments are committed capacity that is always available for your workloads.

Google allocated 100 slots (baseline 0, maximum slots 100) to a project that sought to convert log JSONL data into a compressed Parquet format in order to perform an empirical test. BigQuery was able to handle 1PB of data per day without filling all 100 slots thanks to this setup.

In this blog post, it suggested an architecture that replaces Apache Spark applications with SQL queries running on BigQuery Omni to allow the TCO reduction of Log Analytics workloads in a multi-cloud setting. The capacity of this approach to reduce overall DevOps complexity while cutting engineering, computing, and egress expenditures may be advantageous for your specific data context.
Pricing for BigQuery Omni

For information on prices and limited-time offers, please see BigQuery Omni pricing.

Post a Comment

0 Comments