Open-Source Intel Neural Compressor Joins ONNX for AI

Intel Neural Compressor

In addition to popular model compression techniques like quantization, distillation, pruning (sparsity), and neural architecture search on popular frameworks like TensorFlow, PyTorch, ONNX Runtime, and MXNet, Intel Neural Compressor also aims to provide Intel extensions like Intel Extension for the PyTorch and Intel Extension for TensorFlow. Specifically, the tool offers the following main functions, common examples, and open collaborations:

Limited testing is done for AMD, ARM, and NVidia GPUs via ONNX Runtime; substantial testing is done for a wide range of Intel hardware, including Intel Xeon Scalable Processors, Intel Xeon CPU Max Series, Intel Data Centre GPU Flex Series, and Intel Data Centre GPU Max Series.
Utilising zero-code optimisation solutions, validate well-known LLMs like LLama2, Falcon, GPT-J, Bloom, and OPT as well as over 10,000 wide models like ResNet50, BERT-Large, and Stable Diffusion from well-known model hubs like Hugging Face, Torch Vision, and ONNX Model Zoo. Automatic accuracy-driven quantization techniques and neural coding.
Work together with open AI ecosystems like Hugging Face, PyTorch, ONNX, ONNX Runtime, and Lightning AI; cloud marketplaces like Google Cloud Platform, Amazon Web Services, and Azure; software platforms like Alibaba Cloud, Tencent TACO, and Microsoft Olive.

AI models

AI-enhanced apps will be the standard in the era of the AI PC, and developers are gradually substituting AI models for conventional code fragments. This rapidly developing trend is opening up new and fascinating user experiences, improving productivity, giving creators new tools, and facilitating fluid and organic collaboration experiences.

With the combination of CPU, GPU (Graphics Processing Unit), and NPU (Neural Processing Unit), AI PCs are offering the fundamental computing blocks to enable various AI experiences in order to meet the computing need for these models. But in order to give users the best possible experience with AI PCs and all of these computational engines, developers must condense these AI models, which is a difficult task. With the aim of addressing this issue, Intel is pleased to declare that it has embraced the open-source community and released the Neural Compressor tool under the ONNX project.

ONNX

An open ecosystem called Open Neural Network Exchange (ONNX) gives AI developers the freedom to select the appropriate tools as their projects advance. An open source format for AI models both deep learning and conventional ML is offered by ONNX. It provides definitions for standard data types, built-in operators, and an extendable computation graph model. At the moment, Intel concentrates on the skills required for inferencing, or scoring.

Widely supported, ONNX is present in a variety of hardware, tools, and frameworks. Facilitating communication between disparate frameworks and optimising the process from experimentation to manufacturing contribute to the AI community’s increased rate of invention. Intel extends an invitation to the community to work with us to advance ONNX.

How does a Neural Compressor Work?

With the help of Intel Neural Compressor, Neural Compressor seeks to offer widely used model compression approaches. Designed to optimise neural network models described in the Open Neural Network Exchange (ONNX) standard, it is a straightforward yet intelligent tool. ONNX models, the industry-leading open standard for AI model representation, enable smooth interchange across many platforms and frameworks. Now, Intel elevates ONNX to a new level with the Neural Compressor.

Neural Compressor

With a focus on ONNX model quantization, Neural Compressor seeks to offer well-liked model compression approaches including SmoothQuant and weight-only quantization via ONNX Runtime, which it inherits from Intel Neural Compressor. Specifically, the tool offers the following main functions, common examples, and open collaborations:

Support a large variety of Intel hardware, including AIPC and Intel Xeon Scalable Processors.
Utilising automatic accuracy-driven quantization techniques, validate well-known LLMs like LLama2 and wide models like BERT-base and ResNet50 from well-known model hubs like Hugging Face and ONNX Model Zoo.
Work together with open AI ecosystems Hugging Face, ONNX, and ONNX Runtime, as well as software platforms like Microsoft Olive.

Why Is It Important?

Efficiency grows increasingly important as AI begins to seep into people’s daily lives. Making the most of your hardware resources is essential whether you’re developing computer vision apps, natural language processors, or recommendation engines. How does the Neural Compressor accomplish this?

Minimising Model Footprint

Smaller models translate into quicker deployment, lower memory usage, and faster inference times. These qualities are essential for maintaining performance when executing your AI-powered application on the AI PC. Smaller models result in lower latency, greater throughput, and less data transfer all of which save money in server and cloud environments.

Quicker Inference

The Neural Compressor quantizes parameters, eliminates superfluous connections, and optimises model weights. With AI acceleration features like those built into Intel Core Ultra CPUs (Intel DLBoost), GPUs (Intel XMX), and NPUs (Intel AI Boost), this leads to lightning-fast inference.

AI PC Developer Benefits

Quicker Prototyping

Model compression and quantization are challenging! Through developer-friendly APIs, Neural Compressor enables developers to swiftly iterate on model architectures and effortlessly use cutting-edge quantization approaches such as 4-bit weight-only quantization and SmoothQuant.