PyTorch 2.4 for Intel GPUs speeds AI tasks

PyTorch 2.4 Launches to Speed Up AI Tasks with Initial Support for Intel GPUs. In order to further speed up AI tasks, PyTorch 2.4 now offers initial support for the Intel Data Centre GPU Max Series, which integrates Intel GPUs and the SYCL software stack into the standard PyTorch stack.

Advantages

With Intel GPU support, customers have more options for GPUs and can use a consistent front-end and back-end GPU programming model. Workloads can now be deployed and operated on Intel GPUs with little coding required. To support streaming devices, this version generalizes the PyTorch device and runtime (device, stream, event, generator, allocator, and guard). The generalization facilitates not only PyTorch’s deployment on widely available hardware but also the integration of many hardware back ends.

Integrated PyTorch provides continuous software support, standardized software distribution, and consistent product release schedules, all of which will improve the experience for users of Intel GPUs.

An Overview of Support for Intel GPUs

Eager mode and graph mode are supported in the PyTorch built-in front end thanks to Intel GPU support that has been up streamed into the program. The SYCL programming language is now utilized to implement popular Aten operators in the eager mode. OneAPI Math Kernel Library (oneMKL) and oneAPI Deep Neural Network Library (oneDNN) are used to highly optimize the most performance-critical graphs and operators. To perform the optimization for Intel GPUs and to integrate Triton, the graph mode (torch.compile) now has an enabled Intel GPU back end.

PyTorch 2.4 now includes the necessary parts of Intel GPU support: Aten operators, oneDNN, Triton, Intel GPU source build, and integration of Intel GPU tool chains. In the meantime, PyTorch Profiler which is built on an integration between Kineto and oneMKL is being actively worked on in front of the forthcoming PyTorch 2.5 release.

PyTorch 2.4 Features

Apart from offering essential functionalities for training and inference on the Intel Data Centre GPU Max Series, the PyTorch 2.4 release for Linux maintains the same user interface as other supported hardware for PyTorch.

Using an Intel GPU, PyTorch 2.4 features include:

Workflows for inference and training.
The core eager functions as well as torch.compile are supported, and both eager and compile modes can fully run a Dynamo Hugging Face benchmark.
Data types like automated mixed precision (AMP), BF16, FP32, and so on.
operates on the Intel Data Centre GPU Max Series and Linux.

PyTorch 2.5

The first Intel GPU from the Intel Data Centre GPU Max Series is now available in the PyTorch ecosystem for AI workload acceleration thanks to the Intel GPU on PyTorch 2.4 first support (prototype) release.

In order to achieve beta quality in the PyTorch 2.5 release, they are constantly improving the functionality and performance of the Intel GPU support. Intel Client GPUs will be added to the list of GPUs supported for AI PC use cases as the product develops further. They’re also investigating more features for PyTorch 2.5, like:

Eager Mode: Completely execute Dynamo Torchbench and TIMM eager mode, and implement additional Aten operators.
Torch.compile: Optimise performance while running Dynamo Torchbench and TIMM benchmark compile mode in full.
To support Intel GPU, enable torch.profile under the profiler and utilities section.
Distribution of PyPI wheels.
Support for Windows and the Intel Client GPU Series.
They invite the community to assess these latest additions to PyTorch’s Intel GPU support.

Intel Extensions For PyTorch

The most recent performance enhancements for Intel devices are added to PyTorch using the Intel Extension. The Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs and Intel Advanced Vector Extensions 512 (Intel AVX-512) Vector Neural Network Instructions (VNNI) and Intel Advanced Matrix Extensions (Intel AMX) on Intel CPUs are utilized in optimizations. Additionally, the PyTorch xpu device, in conjunction with Intel Extension for PyTorch, facilitates simple GPU acceleration for Intel discrete GPUs.

Workloads and models for Generative AI (GenAI) have become increasingly common in today’s technological environment. These GenAI applications are mostly driven by large language models, or LLMs. The Intel Extension for PyTorch has added special optimizations for a few Large Language Models (LLMs) as of version 2.1.0. See Large Language Models (LLMs) section for additional details on LLM optimizations.

For Python programs, the extension can be loaded as a module, and for C++ projects, it can be linked as a library. It can be dynamically enabled in Python programs by importing intel_extension_for_pytorch.

Buildings

Eager Mode: Custom Python modules (including fusion modules), optimum optimizers, and INT8 quantization APIs are added to the PyTorch frontend in the eager mode. Using extended graph fusion passes, eager-mode models can be transformed into graph mode to further increase performance.

Graph Mode: Performance is enhanced by fusions’ reduction of operator/kernel invocation overhead in the graph mode. In PyTorch, the graph mode typically produces better results from optimization techniques like operation fusion than the eager mode does.

They are enhanced by the Intel Extension for PyTorch, which offers more thorough graph optimizations. Supported graph modes are PyTorch Torchscript and TorchDynamo. They advise you to use torch.jit.trace() instead of torch.jit.script() when using Torchscript since it typically supports a larger variety of workloads. The ipex backend can deliver strong performance with TorchDynamo.

CPU Optimization: Based on the detected instruction set architecture (ISA), Intel Extension for PyTorch automatically assigns operators to underlying kernels on the CPU. The addon makes use of the Intel hardware’s vectorization and matrix acceleration units. For enhanced performance, the runtime extension provides weight sharing and more precise thread runtime management.

Intel GPU

GPU Optimisation: The PyTorch dispatching method is used to implement and register optimized operators and kernels on the GPU. The intrinsic vectorization and matrix calculating capabilities of Intel GPU hardware enhance certain operators and kernels. The DPC++ compiler, which supports both the most recent SYCL standard and several extensions to the SYCL standard, is used by the Intel Extension for PyTorch for GPU. These extensions are located in the sycl/doc/extensions directory.

Encouragement

GitHub issues are used by the team to keep track of bugs and enhancement requests. Check to see whether your issue has previously been reported on GitHub before making a proposal or bug report.