Optimizations based on BigQuery history Improve Query Performance

 

To improve query speed by up to 100 times, use BigQuery history-based optimizations.

In order to get the answers they seek, users flood the data warehouse with a variety of queries, searching everywhere for insights. Some of those queries need a lot of processing power, and many of them are rather similar to one another. To greatly speed up query performance, Google Cloud developed BigQuery history-based optimizations, a revolutionary query optimization technique that leverages previous executions of similar queries to identify and apply further query execution advancements. By speeding up and/or lowering the amount of resources needed by queries, they may enhance certain performance metrics, such as elapsed time, slot time, and processed bytes.

Using real, historical data from previous executions of comparable queries, BigQuery history-based optimization can safely identify high-reward improvements that are well-suited for a given workload.

During the public beta of BigQuery history-based optimization, Google saw instances where query performance on customer workloads improved by up to 100x. These improvements are now generally available.

This blog article describes the underlying technology of BigQuery history-based optimization, its benefits, and how to get started.

Regarding improvements based on history

BigQuery history-based optimization uses information from earlier executions of comparable queries to apply further optimizations and enhance query performance, including slot time used and query latency. For example, the first query execution may take 60 seconds when history-based optimization is employed, while the second query run may only take 30 seconds if a history-based optimization was discovered. This procedure continues until no further optimizations are required.

Below is an example of how history-based optimizations using BigQuery work:

Execution countQuery slot time consumedNotes
160Original execution.
230First history based-optimization applied.
320Second history based-optimization applied.
421No additional history based-optimizations to apply.
519No additional history based-optimizations to apply.
620No additional history based-optimizations to apply.


History-based enhancements are employed only when there is a high level of confidence that the query speed will improve. Furthermore, if an optimization does not provide a substantial improvement in query performance, it is revoked and not used in following query runs.

It just works

BigQuery history-based optimization was developed by Google Cloud to provide improvements without needing user input. The infrastructure is self-correcting and self-tuning, so you don't need to do any further setups or operations.

BigQuery analyzes the statistical data after each query execution to identify workload-specific improvements that may be used for the next time a comparable query is executed. There is no impact on user workloads from this background operation. After the following execution demonstrates that the optimization is beneficial, other improvements may be discovered and put into place, gradually and iteratively improving query performance.

BigQuery history-based optimization is resilient and will self-correct to avoid using that same improvement for any future query run in the rare case that optimization does not significantly enhance (or even degrade) query speed. You are not required to participate!

Matching queries

BigQuery uses a sophisticated query-matching approach to increase the number of queries that may share found improvements, reducing the risk of applying them to drastically different inquiries.

Simple query modifications, such altering comment strings or whitespace, utilize all previously discovered optimizations. These changes have no effect on how queries are executed.

Nontrivial modifications, including altering the values of constants or query parameters, may reuse some previously discovered improvements and reject others, depending on whether BigQuery is certain that a specific optimization would help the modified query.

If the query is substantially changed, such as by querying a new source table or fields, BigQuery ignores any previously discovered optimizations. This is because the modified query most likely no longer makes use of the statistical data that was used to establish the optimization.

The optimization based on BigQuery history is now restricted to a single Google Cloud project.

Different kinds of optimization

Currently, BigQuery history-based optimization supports four optimization categories that improve on existing optimization methods.

Participate in the pushdown

In order to reduce the quantity of data processed, join pushdown first performs highly selective joins.

When a query has finished executing, BigQuery may identify highly selective joins if the outcome of the join (number of rows) is much less than the input. If the selective join follows less selective operations, such as aggregations or other joins, BigQuery could choose to execute it first. This basically "pushes down" the "join" in the execution sequence, as seen in the figure below. This may save resources and improve query performance by reducing the amount of data that the rest of the query needs to process.

As an example of a join pushdown history-based optimization, the selective join with table T3 is "pushed" "down" in execution order to minimize the number of intermediary rows.

While BigQuery improves certain searches without using statistics from past executions, history-based join pushdown leverages additional data distribution information to improve efficiency even further and apply it to more queries.

Decrease of semijoins

Semijoin reduction aims to reduce the amount of data that BigQuery scans by using selected semijoin operations throughout the query.

BigQuery may identify a very selective join (similar to join pushdown) when a query has many concurrent execution pathways that are ultimately connected. BigQuery may periodically offer new "semijoin" processes that "reduce" the amount of data scanned and processed by certain parallel execution paths, depending on the selective join. This is seen in the figure below. The logical model for this is a semijoin, however partition pruning and other internal optimizations are usually used instead.

An example of a semijoin reduction history-based optimization that lowers the amount of rows processed from the second fact table, F2, by inserting a new semijoin into the query via a selective join

Participate in commuting

Join commutation aims to reduce the amount of resources consumed by flipping the left and right sides of a join operation.

When executing a join operation, the two sides of the join might be treated differently. For example, while scanning the opposite side of the join, BigQuery may explore the hash table that was created by selecting one side of the join. Because the two sides of a join may be "commutative" (the order does not impact the outcome), BigQuery may decide that it could be more efficient in certain circumstances to flip the two sides of a join, decreasing the resources necessary to conduct that join operation.

Adjustment for parallelism

To lower query latency, parallelism adjustment aims to parallelize the activity more efficiently.

BigQuery uses a distributed parallel architecture to execute queries in stages. For each of these steps, BigQuery chooses an initial degree of parallelism, but it may dynamically adjust the level while it runs in response to observed patterns in the data.

BigQuery can now "adjust" the initial level of "parallelism" using history-based optimizations depending on the known workload distribution of the query's previous executions. Specifically, this allows BigQuery to more efficiently parallelize work, reduce overhead, and achieve reduced latency for queries with large, compute-intensive stages.

Try it out

Historical optimizations are now generally available. Over the next months, they will be enabled by default for all customers; however, you may enable them earlier for your project or business. If any BigQuery history-based optimizations were applied in INFORMATION_SCHEMA, you may look at them and see how they impacted your work.

You may try history-based optimizations with this example query:

WITH
january_first_views AS (
SELECT * FROM bigquery-public-data.wikipedia.pageviews_2024
WHERE TIMESTAMP_TRUNC(datehour, DAY) = TIMESTAMP(‘2024-01-01’)
),
february_first_views AS (
SELECT * FROM bigquery-public-data.wikipedia.pageviews_2024
WHERE TIMESTAMP_TRUNC(datehour, DAY) = TIMESTAMP(‘2024-02-01’)
),
march_first_views AS (
SELECT * FROM bigquery-public-data.wikipedia.pageviews_2024
WHERE TIMESTAMP_TRUNC(datehour, DAY) = TIMESTAMP(‘2024-03-01’)
),
april_first_views AS (
SELECT * FROM bigquery-public-data.wikipedia.pageviews_2024
WHERE TIMESTAMP_TRUNC(datehour, DAY) = TIMESTAMP(‘2024-04-01’)
),
average_views AS (
SELECT
j.title,
SUM(j.views + f.views + m.views + a.views) / 4 AS average_views
FROM january_first_views AS j
INNER JOIN february_first_views AS f USING (wiki, title)
INNER JOIN march_first_views AS m USING (wiki, title)
INNER JOIN april_first_views AS a USING (wiki, title)
GROUP BY title
)

SELECT title, average_views.average_views
FROM average_views
INNER JOIN bigquery-public-data.wikipedia.wikidata AS wikidata
ON wikidata.en_wiki = average_views.title
WHERE wikidata.en_label = ‘Google’;

Compare the timing differences between the first and subsequent iterations of the demo query by running it again.

You may optionally check if history-based optimizations were applied for each run by following the steps provided, using the task IDs for each execution.

Taking the future into account

BigQuery history-based optimization offers a framework for ongoing investments in BigQuery's optimization capabilities rather than only a static set of four new enhancements. It would be hard to carry out a new class of optimizations without real statistical data, not merely estimations from earlier executions of similar queries. This is made feasible by technology. Its ability to learn and repair itself serves as a buffer against unanticipated issues.

By extending existing and planned history-based improvements and enabling BigQuery to adapt automatically to changing workloads, it aims to make BigQuery smarter. You won't have to do anything; queries will just run faster and more efficiently. Try it out now by following the above steps, and then use the BigQuery console to let us know what you think.




Post a Comment

0 Comments