Gluten, Intel CPUs Improve Apache Spark SQL

Using Gluten and Intel CPUs may enhance Spark's performance.

More important than ever are the tools and platforms that companies use to assess the ever-growing volumes of data that are flowing in from devices, customers, websites, and more. Because big data analytics offers insights that are both time- and business-critical, efficiency and performance are essential.

Large data analytics workloads on Apache Spark SQL frequently run continuously, requiring top performance to speed up time to insight. This suggests that companies might justify spending a little more overall if they want to see better returns on their investment. In the last blog post, Spark SQL performance on Google Cloud instances was examined.

Scalable Data Science Is Made Possible by Spark

Businesses utilize Apache Spark extensively for batch and stream processing, machine learning and other AI applications, and large-scale SQL. Spark uses a distributed approach, distributing data among numerous machines in clusters, to facilitate data science at scale. Because of this dispersion, there is some overhead involved in finding the data for each query. In order to make business choices more quickly, query speed is a crucial aspect of any Spark workload. This is especially true for workloads that involve training for machine learning.

Using Gluten to Accelerate the Spark

Businesses have been developing ways to enhance Spark, despite the fact that it is a helpful tool for speeding up and simplifying large data processing. One such initiative that minimizes computation-intensive important data processing and moves it to native accelerator libraries is Intel's Gluten Spark-SQL execution engine for the Optimized Analytics Package (OAP).

Velox (Meta's open-source) C++ generic database acceleration toolkit is a vectorized SQL processing engine that Gluten employs to enhance query engines and data processing systems. "A middle layer responsible for offloading the execution of JVM-based SQL engines to native engines" is a Spark plugin known as Gluten. Users can greatly improve the performance of their Spark applications by using the Apache Gluten plugin with Intel CPU accelerators.

It works by transforming Spark query execution plans into the cross-language data processing standard Substrait, and then using a JNI call to communicate the now-readable plans to native libraries. Before the execution plan is returned to Gluten as a Columnar Batch, it is efficiently built, loaded, and managed by the native engine, which also controls native memory allocation. Gluten then sends the data back to Spark JVM as ArrowColumnarBatch.

Gluten uses a fallback method to run vanilla Spark to handle unsupported operators and a shim layer to support various Spark versions. It records metrics from the native engine and displays them in the Spark user interface.

The Gluten plugin uses Spark's own architecture, control flow, and JVM code, but it outsources as many compute-intensive data processing components to native code as possible. Since no changes are required on the query end, existing data frame APIs and applications will continue to operate as before, although more quickly.

Performance Improvements Were Noted

This section looks at test results that demonstrate how adding Gluten to your Spark apps might improve performance. One builds a general-purpose decision assistance system based on TPC-DS using 99 different database queries. The other, based on TPC-H, simulates a general-purpose decision support system using ten different database queries. In the Spark SQL cluster, everyone compared how long it took a single user to complete each query once.

Intel Xeon Scalable Processors of the Fourth Generation

Start by analyzing the performance impact of adding Gluten to Spark SQL on servers equipped with 4th Generation Intel Xeon Scalable Processors. The chart below shows that when it was implemented, the performance increased 3.12 times. On the TPC-H-like workload, the accelerator allowed the system to do the ten database queries more than three times faster. On the workload similar to TCP-DS, Gluten more than doubled the speed at which all 99 database queries were finished. Decision-makers would receive responses faster as a result of these improvements, demonstrating the value of integrating Gluten into your Spark SQL processes.

Intel Xeon Scalable Processors of the Fifth Generation

Now, let's examine how Gluten accelerates Spark SQL applications running on servers that have Fifth Generation Intel Xeon Scalable Processors. You noticed significantly greater increases than they did on the servers with older CPUs, as the accompanying chart shows, with speed up to 3.34 times as high while using Gluten. If your data center has servers of this generation, incorporating gluten into your environment will help you get more out of your equipment and shorten time to insight.

Implications for the Cloud

These experiments clearly demonstrate how Gluten may improve performance even in the cloud, despite the fact that they were conducted in a data center with bare metal hardware. You might be able to use Gluten to benefit from further speed improvements when running Spark on the cloud.

To sum up

Regardless of whether your Spark SQL workloads are operating on servers with the older or 5th edition of Intel Xeon Scalable Processors, completing analyses quickly is crucial to the profitability of your company. Gluten could profit from the speed boost that Intel processors can offer with native libraries that are tailored to instruction sets by moving JVM data processing to native libraries.

These experiments show that incorporating the Gluten plugin into Spark SQL workloads can easily double or even triple the pace at which your servers execute database queries. With up to 3.34x the performance, Gluten may help your business optimize data analytics workloads.