SYCL Standouts
With an open-sourced repository for the implementation, Intel is pleased to introduce the first SYCL implementation of fully-fused Multi-Layer Perceptrons implemented on Intel GPUs that enable Intel Xe Matrix Extensions (XMX) instructions. The implementation offers many features, such as cross-platform use, multi-resolution hash encoding, adaptability to neural network architectures, compatibility with PyTorch, and high-performance computing.
The implementation beats the CUDA PyTorch version running on Nvidia’s H100 GPU by up to a factor of 19, and it beats the pre-made Intel Extension for PyTorch (IPEX) implementation running on the same Intel GPU by up to a factor of 30.
Multi Layer Perceptron
For many modern Machine Learning (ML) applications, such as representing the solution operator of partial differential equations, determining the density or colour function in Neural Radiance Fields (NeRFs) objects, and substituting Neural Ray Tracing for classical ray-tracing, Multi-Layer Perceptrons (MLPs) serve as the primary Neural Network architecture. The completely linked layers of MLPs are typified by the connections between each neuron in the layer and all the layers above and below. MLPs are ideal for fully-fusing processes since each neuron’s output is independent of its neighbours in the same layer.
The first SYCL implementation of fully-fused MLPs applied to Intel GPUs supporting Intel Xe Matrix Extensions (XMX) instructions is proudly presented by Intel, along with an open-sourced implementation repository. By combining the operations in each tier of the MLP, this implementation minimises the sluggish global memory access and maximises data reuse inside the general register file and shared local memory. Using a roofline model, Intel demonstrate that this leads to a notable rise in the arithmetic intensity and better performance, particularly for inference. Additionally, the study demonstrates the effectiveness of Intel’s SYCL implementation in three key domains: Neural Radiance Fields, Physics-Informed Machine Learning, and Image Compression.
Multi-Layer Perceptron
A SYCL implementation of Multi-Layer Perceptrons (MLPs) optimised for the Intel Data Centre GPU Max 1550 is shown in this work. Intel’s approach maximises data reuse inside the general register file and shared local memory by fusing operations in each layer of the MLP, hence minimising sluggish global memory accesses and increasing efficiency. Using a basic roofline model, Intel demonstrate that this leads to a notable rise in the arithmetic intensity and better performance, particularly for inference.
Intel Extension for PyTorch
Intel demonstrate that Intel’s implementation on the Intel Data Centre GPU beats the CUDA code on Nvidia’s H100 GPU by a ratio up to 2.84 in inference and 1.75 in training, when Intel compare Intel’s method to a similar CUDA implementation for MLPs. Additionally, the study demonstrates the effectiveness of Intel’s SYCL implementation in three key domains: Neural Radiance Fields, Physics-Informed Machine Learning, and Image Compression. Intel’s approach beats the CUDA PyTorch version on Nvidia’s H100 GPU by up to a factor 19, and the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 in all circumstances.
SYCL Features
First among several advantages of Intel’s approach is high-performance computation; high-throughput training and inference are made possible by the system’s efficient operation on Intel Data Centre GPUs. Additionally, the technique offers Python bindings that smoothly interact with the PyTorch environment, allowing users to include GPU-accelerated MLPs into PyTorch applications. It also offers adaptability by enabling a range of neuron topologies and networks with numerous hidden layers to meet various performance needs and use cases. It also incorporates Multi-Resolution Hash Encoding, which enables the network to efficiently handle high-frequency features, and is built to function on a variety of Intel GPUs, enhancing the framework’s adaptability and usability on diverse platforms.
SYCL Achievement
Intel’s fully-fused MLP implementation improves the results of a number of popular AI tasks. Intel compared Intel’s SYCL implementation on an Intel Data Centre GPU Max 1550 with the CUDA implementation on an Nvidia H100 GPU and PyTorch utilising both the CUDA backend and the Intel Extension for PyTorch (IPEX) in order to illustrate these performance advantages.
The results demonstrate the success of Intel’s approach: in Intel’s tests, the implementation outperforms the PyTorch implementation by up to a factor of 30, and outperforms an analogous CUDA implementation for MLPs with width 64 by a ratio of up to 2.84 in inference and 1.75 in training.
Intel also demonstrated the effectiveness of Intel’s solution in three key domains: NeRF (Neural Radiance Fields), Image Compression, and Physics-Informed Machine Learning. Intel’s method showed significant increases in all three categories, with factors reaching up to 30 times over traditional PyTorch implementations and up to 2.84 times over highly optimised CUDA versions.
Considering the Future
Intel want to further optimise Intel’s approach in the future, with a particular emphasis on using registers more effectively in order to cut down on stalls. Furthermore, by enabling the loads of various weight matrices in the shared local memory (SLM) and lowering its utilisation, Intel might be able to lower the number of barriers that are required. Increasing occupancy for small batch sizes and optimising the merging of the final matrix products into the backward pass will be additional areas of focus.
Intel intend to investigate using Intel’s ESIMD SYCL extension for Intel’s implementation, as well as generalise Intel’s library to different data kinds and wider network widths, in addition to performing more speed optimisation.
0 Comments