An Introduction To NumPy and SciPy in Python for Multithreading

An Easy Guide to Multithreading in Python

Python is a strong language, particularly for developing AI and machine learning applications. However, CPython, the programming language’s original, reference implementation and byte-code interpreter, lacks multithreading functionality; multithreading and parallel processing need to be enabled from the kernel. Some of the desired multi-core processing is made possible by libraries Python NumPy and SciPy such as NumPy, SciPy, and PyTorch, which use C-based implementations. However, there is a problem known as the Global Interpreter Lock (GIL), which literally “locks” the CPython interpreter to only working on one thread at a time, regardless of whether the interpreter is in a single or multi-threaded environment.

Let’s take a different approach to Python.

The robust libraries and tools that support Intel Distribution of Python, a collection of high-performance packages that optimize underlying instruction sets for Intel architectures, are designed to do this.

For compute-intensive, core Python numerical and scientific packages like NumPy, SciPy, and Numba, the Intel distribution helps developers achieve performance levels that are comparable to those of a C++ program by accelerating math and threading operations using oneAPI libraries while maintaining low Python overheads. This enables fast scaling over a cluster and assists developers in providing highly efficient multithreading, vectorization, and memory management for their applications.

Let’s examine Intel’s strategy for enhancing Python parallelism and composability in more detail, as well as how it might speed up your AI/ML workflows.

Parallelism in Nests: Python NumPy and SciPy

Python libraries called Python NumPy and SciPy were created especially for scientific computing and numerical processing, respectively.

Exposing parallelism on all conceivable levels of a program for example, by parallelizing the outermost loops or by utilizing various functional or pipeline sorts of parallelism on the application level is one workaround to enable multithreading/parallelism in Python scripts. This parallelism can be accomplished with the use of libraries like Dask, Joblib, and the included multiprocessing module mproc (with its ThreadPool class).

Data-parallelism can be performed with Python modules like Python NumPy and SciPy, which can then be accelerated with an efficient math library like the Intel oneAPI Math Kernel Library (oneMKL). This is because massive data processing requires a lot of processing. Using various threading runtimes, oneMKL is multi-threaded. An environment variable called MKL_THREADING_LAYER can be used to adjust the threading layer.

As a result, a code structure known as nested parallelism is created, in which a parallel section calls a function that in turn calls another parallel region. Since serial sections that is, regions that cannot execute in parallel and synchronization latencies are typically inevitable in Python NumPy and SciPy based systems, this parallelism-within-parallelism is an effective technique to minimize or hide them.

Going One Step Further: Numba

Despite offering extensive mathematical and data-focused accelerations through C-extensions, Python NumPy and SciPy remain a fixed set of mathematical tools accelerated through C-extensions. If non-standard math is required, a developer should not expect it to operate at the same speed as C-extensions. Here’s where Numba can work really well.

OneTBB

Based on LLVM, Numba functions as a “Just-In-Time” (JIT) compiler. It aims to reduce the performance difference between Python and compiled, statically typed languages such as C and C++. Additionally, it supports a variety of threading runtimes, including workqueue, OpenMP, and Intel oneAPI Threading Building Blocks (oneTBB). To match these three runtimes, there are three integrated threading layers. The only threading layer installed by default is workqueue; however, other threading layers can be added with ease using conda commands (e.g., $ conda install tbb).

The environment variable NUMBA_THREADING_LAYER can be used to set the threading layer. It is vital to know that there are two ways to choose this threading layer: either choose a layer that is generally safe under different types of parallel processing, or specify the desired threading layer name (e.g., tbb) explicitly.

Composability of Threading

The efficiency or efficacy of co-existing multi-threaded components depends on an application’s or component’s threading composability. A component that is “perfectly composable” would operate without compromising the effectiveness of other components in the system or its own efficiency.

In order to achieve a completely composable threading system, care must be taken to prevent over-subscription, which means making sure that no parallel region of code or component can require a certain number of threads to run (this is known as “mandatory” parallelism).

An alternative would be to implement a type of “optional” parallelism in which a work scheduler determines at the user level which thread(s) the components should be mapped to while automating the coordination of tasks among components and parallel regions. Naturally, the efficiency of the scheduler’s threading model must be better than the high-performance libraries’ integrated scheme since it is sharing a single thread-pool to arrange the program’s components and libraries around. The efficiency is lost otherwise.

Intel’s Strategy for Parallelism and Composability

Threading composability is more readily attained when oneTBB is used as the work scheduler. OneTBB is an open-source, cross-platform C++ library that was created with threading composability and optional/nested parallelism in mind. It allows for multi-core parallel processing.

An experimental module that enables threading composability across several libraries unlocks the potential for multi-threaded speed benefits in Python and was included in the oneTBB version released at the time of writing. As was previously mentioned, the scheduler’s improved threads allocation is what causes the acceleration.

The ThreadPool for Python standard is replaced by the Pool class in oneTBB. Additionally, the thread pool is activated across modules without requiring any code modifications thanks to the use of monkey patching, which allows an object to be dynamically replaced or updated during runtime. Additionally, oneTBB replaces oneMKL by turning on its own threading layer, which allows it to automatically provide composable parallelism when using calls from the Python NumPy and SciPy libraries.

See the code samples from the following composability demo, which is conducted on a system with MKL-enabled NumPy, TBB, and symmetric multiprocessing (SMP) modules and their accompanying IPython kernels installed, to examine the extent to which nested parallelism can enhance performance. Python is a feature-rich command-shell interface that supports a variety of programming languages and interactive computing. To get a quantifiable performance comparison, the demonstration was executed using the Jupyter Notebook extension.

import NumPy as np
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)

The aforementioned cell must be executed again each time the kernel in the Jupyter menu is changed in order to build the ThreadPool and provide the runtime outcomes listed below.

The following code, which runs the identical line for each of the three trials, is used with the default Python kernel:

%timeit pool.map(np.linalg.qr, [np.random.random((256, 256)) for i in range(10)])

This approach can be used to get the eigenvalues of a matrix using the standard Python kernel. Runtime is significantly improved up to an order of magnitude when the Python-m SMP kernel is enabled. Applying the Python-m TBB kernel yields even more improvements.

OneTBB’s dynamic task scheduler, which most effectively manages code where the innermost parallel sections cannot fully utilize the system’s CPU and where there may be a variable amount of work to be done, yields the best performance for this composability example. Although the SMP technique is still quite effective, it usually performs best in situations when workloads are more evenly distributed and the loads of all workers in the outermost regions are generally identical.

In summary, utilizing multithreading can speed up AI/ML workflows

The effectiveness of Python programs with an AI and machine learning focus can be increased in a variety of ways. Using multithreading and multiprocessing effectively will remain one of the most important ways to push AI/ML software development workflows to their limits.