Transferring Calls from CUDA Math Library to oneAPI Math Kernel Library

 


OneMKL overview

Science, finance, enterprise, and communications applications use advanced math libraries for linear algebra (BLAS, LAPACK, SPARSE), vector math, Fourier transformations, random number generation, and linear equation or analytical solvers. According to the oneAPI Specification, oneMKL is a complete math kernel library and solver package. SYCL/DPC++ interfaces for performance math library functions are defined here.

  • Intel implements the specification as oneAPI Math Kernel Library. This free binary-only download is part of the Intel oneAPI Base Toolkit. oneMKL is deeply optimised for Intel CPUs and GPUs.
  • The open-source oneAPI Math Kernel Library (oneMKL) Interfaces Project implements the specification. The project aims to show how any math library and target hardware can implement the oneMKL SYCL/DPC++ interfaces. This allows it to be used on CPUs, GPUs, FPGAs, and accelerators.

Both implementations allow oneAPI Math Kernel Library to share oneAPI’s vision and offer specific hardware platform configurations with a shared codebase for diverse accelerated computation platforms (source code and performance portability).

oneMKL Software Architecture

The oneAPI standard has these domains of functionality:

  • BLAS, LAPACK Dense Linear Algebra
  • Sparse Linear Algebra
  • Discretised Fourier Transforms
  • Generators of random numbers
  • Vector Math

oneAPI Math Kernel Library supports Linear Algebra, including BLAS and LAPACK operations, SPARSE, FFT, RNG, Data Fitting, Vector Math, and Summary Statistics.

Currently, oneMKL Interfaces supports BLAS, LAPACK, RNG, DFT, SPARSE_BLAS.

How it Works

Model oneMKL GPU Offload

A host connects to one or more GPU compute devices in the GPGPU compute model. Each compute device has several GPU Compute Engines (CE), sometimes called Execution Units (EU) or Xe Vector Engines. A host program and kernels run in its context. These kernels communicate with the host via a command queue.

Kernel-enqueue commands define an N-dimensional index space for kernel execution. The kernel, argument values, and index space parameters make up a kernel-instance. The kernel function executes for each point in the index space or N-dimensional range when a compute device executes such an instance.

Commands in host command queues can also synchronise. This mode lets one command depend on execution points in another or more instructions.

Atomics and Fences are other memory-order-based program synchronisations. In the data-parallel computation model, these synchronisation types determine how a work-item’s memory operation is visible to another, providing microsynchronous points.

OneMKL utilises the fundamental execution paradigm in Intel Graphics Compute Runtime for oneAPI Level Zero and OpenCL Driver.

Basic oneMKL SYCL API

oneAPI Math Kernel Library allows automated GPU offload dispatch with OpenMP pragmas, but we’ll focus on SYCL queues.

SYCL is a royalty-free, cross-platform abstraction layer that allows ISO C++ or newer code on heterogeneous and offload processors. It provides APIs and abstractions to identify and manage data resources and code execution on CPUs, GPUs, and FPGAs.

The oneMKL SYCL API, part of the oneAPI spec and open-source, is ideal for transitioning CUDA proprietary library function APIs to an open standard.

oneAPI Math Kernel Library organises routines by mathematical domain using C++ namespaces. The oneapi::mkl basic namespace contains all oneMKL objects and procedures. Individual oneMKL domains employ a secondary namespace layer:

oneMKL class-based APIs like RNG and DFT require a sycl::queue argument to the constructor or setup routine. The previous section’s computational routine execution requirements apply to computational class methods.

Assign a sycl::device instance to control GPU device utilisation. If supported by the device architecture, a device instance can be partitioned into subdevices.

Multiple devices can be used simultaneously with the oneAPI Math Kernel Library SYCL API’s asynchronous computing functions. Every computing routine queues work for the chosen device and may return before finishing.

Sycl::buffer objects automatically synchronise kernel launches linked by data dependencies. oneAPI Math Kernel Library routines need not synchronise sycl::buffer parameters.

The caller program must manage asynchronicity when oneMKL routines employ USM pointers as input or output. To assist the calling application, all oneMKL procedures with a USM pointer parameter can optionally reference a list of input events (std::vector) and return a sycl::event representing calculation completion.
sycl::event mkl::domain::routine(…, std::vector &in_events = {});

ALL oneMKL functions are host thread safe.

Third-Party GPU Support

This open backend architecture lets the oneMKL SYCL API work with several offload devices, including NVIDIA and AMD GPUs.

To compile your software for  AMD or NVIDIA GPUs, install the relevant GPU drivers or plug-ins:

  • The oneAPI for AMD GPUs plugin supports AMD GPUs.
  • The oneAPI for NVIDIA GPUs plugin allows GPU utilisation.

CUDA Compatibility

Library Wrappers

View how oneAPI Math Kernel Library feature relates to a given backend by inspecting the {backend}_wrappers.cpp file in the relevant directory.

CUDA Support

The following command will display supported CUDA functions after installing the CUDA-to-SYCL migration tool as described in the next section:

Use dpct with a query-api mapping of .

Migrating CUDA to SYCL

CUDA code transfer to SYCL simplifies heterogeneous computing for math functions that support CUDA-compatible backends and proprietary NVIDIA hardware while liberating your code to run on multi-vendor hardware.

CUDA->SYCL math library migration tools, procedure, and options are covered here.

Download Software Intel and oneAPI offer efficient CUDA-to-SYCL transfer solutions.

  • The current Intel oneAPI Base Toolkit provides all Intel components needed for Intel-branded software implementation and Intel hardware, including:
    • Migration tool DPC++ compatibility
    • Intel oneAPI DPC++/C++ Compiler
    • Intel oneMKL library
    • Use Intel VTune Profiler for post-migration code optimisation.
  • If you plan to use open-source software on Nvidia or AMD GPUs, deploy the appropriate tools:
    • Use the SYCLomatic migration tool from the GitHub repository and follow the build steps.
    • For oneAPI Math Kernel Library Interfaces, follow the build instructions in the Get Started page.
    • A compiler that supports DPC++/SYCL code extensions. Required for SYCL output compilation. Follow oneMKL Specification to choose a compiler for your application.
  • To compile your software for AMD or NVIDIA GPUs, install the plug-ins:
    • Install the oneAPI for AMD GPUs plugin to use AMD GPUs.
    • Use an NVIDIA GPU with the oneAPI for NVIDIA GPUs plugin.

Try Code Samples

To grasp the SYCL conversion procedure, the CUDA-to-SYCL migration course starts with basic CUDA examples and progresses to more complicated projects employing different features and libraries.

The Migrate from CUDA to C++ with SYCL portal provides technical information on automatic migration tools, including several guided CUDA-to-SYCL code samples. For understanding CUDA library code migration to oneMKL, the following are most interesting:

  • Migrating MonteCarloMultiGPU from CUDA to SYCL (Source Code) walks through cuRAND to oneMKL and SYCL conversion.
  • Guided cuBLAS Examples for SYCL Migration gives a complete reference for linear algebra and GEMM migration.
  • Train CUDA-to-SYCL Migration Jupyter Notebook GitHub.

Sample cuBLAS migration code

The NVIDIA/CUDA Library GitHub repository’s cuBLAS Library – APIs Examples migration will be examined.

Sample source code (SYCL) is converted from CUDA to GPU/CPU for computations. Sample shows how to:

  • Migrate code to SYCL and optimise migration steps.
  • Speed up processing
  • Each cuBLAS sample uses oneAPI Math Kernel Library routines in its source files. All are one-function programs.


Post a Comment

0 Comments