Intel FPGAs speed up databases with oneAPI and SIMD orders

A cutting-edge strategy for improving single-threaded CPU speed is Single Instruction Multiple Data (SIMD).

FPGAs are known for high-performance computing via customizing circuits for algorithms. Their tailored and optimized hardware accelerates difficult computations.

SIMD and FPGAs seem unrelated, yet this blog article will demonstrate their compatibility. By enabling data parallel processing, FPGAs can boost processing performance with SIMD. For many computationally intensive activities, FPGA adaptability and SIMD efficiency are appealing.

High-performance SIMDified programming

SIMD parallel processing applies a single instruction to numerous data objects. Special hardware extensions can execute the same instruction on several data objects simultaneously.

SIMDified processing uses data independence to boost software application performance by rewriting application code to use SIMD instructions extensively.

Key advantages of SIMDified processing include:

Increased performance: SIMDified processing boosts computationally intensive software applications.

Integrability: Intrinsics and dedicated data types make SIMDified processing desirable.

SIMDified processing is available on many current processors, giving it a viable option for computational speed improvement.

Despite its benefits, SIMDified processing is not ideal for many applications. Applications with minimal data parallelism will not benefit from SIMDified processing. It is a convincing method for improving data-intensive software applications.

SIMD Portability Supports Heterogeneity

SIMD registers and instructions make up SIMD instruction sets. SIMD intrinsics in C/C++ are the best low-level programming method for performance.

Low-level programming in heterogeneous settings with different hardware platforms, operating systems, architectures, and technologies is difficult due to hardware capabilities, data parallelism, and naming standards.

Specialized implementations limit portability between platforms, hence SIMD abstraction libraries provide a common SIMD interface and abstract SIMD functions. These libraries use C++ template metaprogramming and function template specializations to translate to SIMD intrinsics and potential compensations for missing functions, which must be implemented.

C/C++ libraries let developers construct SIMD-hardware-oblivious application code and SIMD extension code with minimum overhead. Separating SIMD-hardware-oblivious code with a SIMD abstraction library simplifies both sides.

This method has promoted many SIMD libraries and abstraction layers to solve problems:

Examples of SIMD libraries
Google Highway (open-source)
Xsimd (C++ wrapper for SIMD instances)

Such libraries allow SIMDified code to be designed once and specialized for the target SIMD platform by the SIMD abstraction library. Libraries and varied design environments suit SIMD instructions and abstraction.

Accelerating with FPGAs

FPGAs speed software at low cost and power. Traditional FPGAs required a strong understanding of digital design concepts and specific languages like VHDL or Verilog. FPGA-based solutions are harder to access and more specialized than CPU or GPU-based computing platforms due to programming complexity and code portability. Intel oneAPI changes this.

Intel oneAPI is a software development kit that unifies CPU, GPU, and FPGA programming. It supports C++, Fortran, Python, and Data Parallel C++ (DPC++) for heterogeneous computing to improve performance, productivity, and development time.

Since Intel oneAPI can target FPGAs from SYCL/C++, software developers are increasingly interested in using them for data processing. FPGAs can be used with SIMDified applications by adding them as a backend to the SIMD abstraction library. This allows SIMD applications with FPGAs.

SIMD and FPGAs go together
Annotations let the Intel DPC++ compiler synthesis C++ code into circuits and auto-vectorize data-parallel processing. Annotating and implementing code arrays as registers on an FPGA removes data access constraints and allows parallel processing from sink to source. This enables SIMD performance acceleration using FPGAs straightforward and configurable.

SIMD abstraction libraries are a logical choice for FPGA SIMD processing. As noted, the libraries support Intel and ARM SIMD instruction set extensions. TSL abstraction library simplifies FPGA SIMD instruction implementation in the following example. The scalar code specifies loading registers, and the pragma unroll attribute tells the DPC++ Compiler to implement all pathways in parallel in the generic element-wise addition example below.

This simple element-wise example has no dependencies, and comparable implementations will work for SIMD instructions like scatter, gather, and store. Optimization can also accelerate complex instructions.

A horizontal reduction requires a compile-time adder tree of depth ld(N), where N is the number of entries. Unroll pragmas with compile-time constants can implement adder trees in a scalable manner, as shown in the following code example.

Software that calls a library of comparable SIMD components can expedite SIMD instructions on Intel FPGAs by adding the examples above.

Intel FPGA Board Support Package adds system benefits. Intel FPGAs use a BSP to describe hardware interfaces and offer a kernel shell.

The BSP enables SYCL Universal Shared Memory (USM), which frees the CPU from data transfer management by exchanging data directly with the accelerator. FPGAs can be coprocessors.

The pre-compiled BSP generates only kernel logic live, reducing runtime.

Intel FPGAs are ideal for SIMD and streaming applications like current composable databases because to their C++/SYCL compatibility, CPU data transfer offloading, and pre-compiled BSPs.

SIMD/FPGA simplicity
At SiMoDSIGMOD 2023 in Seattle, USA, Dirk Habich, Alexander Krause, Johannes Pietrzyk, and Wolfgang Lehner of TU Dresden presented their paper “Simplicity done right for SIMDified query processing on CPU and FPGA” on using FPGAs to accelerate SIMD instructions. The work, supported by Intel’s Christian Färber, illustrates how practical and efficient developing a SIMDified kernel in an FPGA is while achieving top performance.

The paper evaluated FPGA acceleration of SIMD instructions using a dual-socket 3rd-generation Intel Xeon Scalable processor (code-named “Ice Lake”) with 36 cores and a base frequency of 2.2 GHz and a BitWare IA-840f acceleration card with an Intel Agilex 7 AGF027 FPGA and 4x 16 GB DDR4 memories.

First, they gradually increased the SIMD instance register width to see how it affected maximum acceleration bandwidth. The first instance, a simple aggregation, revealed that the FPGA accelerator’s bandwidth improves with data width doubling until the global bandwidth saturates an ideal acceleration case.

The second scenario, a filter-count kernel with a data dependency in the last stage of the adder tree, demonstrated similar behavior but saturates earlier at the PCIe link width. Both scenarios demonstrate the considerable speeding gains of natively parallel instructions on a highly parallel architecture and suggest that wide memory accesses could sustain the benefits.

Final performance comparisons compared the FPGA and CPU. CPU and FPGA received the same multi-threaded AVX512-based filter-count kernel. As expected, per-core CPU bandwidth decreased as thread count and CPU core count grew. FPGA performance was peak across all workloads.

Based on this work, the TU Dresden and Intel team researched how to use TSL to use an FPGA as a bespoke SIMD processor.