Announcing Azure HBv5 and ND GB200 V6 Virtual Machines

 

Presenting Azure HBv5 Virtual Machines: A Further Development in HPC Memory Bandwidth

Azure HBv5, the newest CPU-based virtual machine for HPC clients and their applications, was unveiled by Satya Nadella at Microsoft Ignite today. This new virtual machine (VM) is ideally suited for the most memory-intensive HPC applications, such as computer-aided engineering, energy research, weather modeling, computational fluid dynamics, automotive and aerospace simulation, molecular dynamics, and more.

For many HPC customers, memory performance from conventional server architectures is the largest barrier to achieving the necessary levels of workload performance (time to insight) and cost-effectiveness. To overcome this problem, AMD and Microsoft have teamed up to develop a special 4th Generation EPYC processor with high bandwidth memory (HBM). In an Azure HBv5 virtual machine, four of these CPUs work together to give about 7 TB/s of memory bandwidth.

Comparatively speaking, this is up to 35 times more costly than an HPC server that is 4–5 years old and nearing the end of its hardware lifecycle, up to 8 times more expensive than the newest bare-metal and cloud alternatives, and almost 20 times more expensive than Azure HBv3 and Azure HBv2 (3rd Gen EPYC with 3D V-cache “Milan-X” and 2nd Gen EPYC “Rome”).

Improvements and developments in HPC throughout the technology stack

Although Azure HBv5's memory bandwidth is a noteworthy feature, Microsoft and AMD have jointly developed platform-wide improvements to provide customers with a virtual machine (VM) that is safe, balanced, user-configurable, and exceptionally performant for a variety of HPC workloads.

Each virtual machine running Azure HBv5 will have:

  • 6.9 TB/s of memory bandwidth (STREAM Triad) and 400–450 GB of RAM (HBM3)
  • Up to 9 GB of memory (customer configurable) can be allocated to per core.
  • 352 AMD EPYC "Zen4" CPU cores at most, with a maximum speed of 4 GHz (customizable)
  • When comparing Infinity Fabric to previous AMD EPYC server architectures, the bandwidth between CPUs is increased.
  • architecture with only one tenant and no SMT (1 virtual machine per server)
  • 800 Gb/s of NVIDIA Quantum-2 InfiniBand is used to balance the 200 Gb/s per CPU SoC.
  • With Azure VMSS Flex, MPI applications may be scaled to hundreds of thousands of CPU cores with HBM power.
  • Azure Boost NIC, second version, featuring Azure Accelerated Networking at 160 Gbps
  • An NVMe SSD with a 14 TB local capacity can achieve up to 50 GB/s read and 30 GB/s write bandwidth.

Sign up for the preview of the Azure HBv5 virtual machine

Registration is now open for the Azure HBv5 Preview, which will launch in the first part of 2025. Visit Microsoft Azure booth #1905 at Supercomputing 2024, which will take place from November 19–22 in Atlanta, Georgia, to see Azure HBv5 and other Azure supercomputing products. Professionals can also be consulted regarding how this virtual machine can assist with your HPC workloads.

Microsoft uses NVIDIA Blackwell to power the next level of AI supercomputing

I'm pleased to announce the first cloud private preview of the Azure ND GB200 V6 VM series, which is based on the NVIDIA accelerated computing architecture. This latest virtual machine is powered by the NVIDIA GB200 Grace Blackwell Superchip, which has NVIDIA Blackwell GPUs and NVIDIA Grace CPUs with exceptional AI supercomputing capabilities for accelerating generative inferencing and training state-of-the-art frontier models.

The Azure ND GB200 V6 VM series is built on top of Microsoft's proprietary server with NVIDIA Blackwell, which features two GB200 Grace Blackwell Superchips. Each GB200 Superchip connects a Grace CPU to two potent Blackwell GPUs via the NVIDIA NVLink-C2C interface. NVLink-C2C addresses the high-speed memory requirements of next-generation trillion-parameter large language models (LLMs) and enables applications to access a unified memory space at high speed and coherently, simplifying programming.

With Microsoft's ND GB200 V6 virtual machines, which can scale up to 18 compute servers via NVIDIA NVLink Switch trays, up to 72 Blackwell GPUs can be utilized in a single NVLink domain. With the latest NVIDIA Quantum InfiniBand connecting them, these virtual machines can potentially grow to tens of thousands of GPUs for unprecedented AI supercomputing performance.

On any Microsoft server with NVIDIA Blackwell, the latest version of Azure Boost a specially created solution that enhances the server virtualization stack for greater resilience, management, and security will be deployed. Azure Boost boosts storage performance, supports 200 Gbps network rates, and offers the best IO performance for both CPU and GPU.

Every new Microsoft server will have an Azure Integrated Hardware Security Module (HSM) installed because security is so crucial. This module enhances key management capabilities without compromising security or performance by encrypting and ensuring that signing keys always remain inside the hardware security module. It is designed to meet FIPS (Federal Information Processing Standards) 140-3 level 3 certification requirements.

Through increased dependability, scalability, and performance, these capabilities provide clients with exceptional value. Being able to create and deploy sophisticated AI models more quickly and efficiently will help commercial enterprises stay ahead of the curve and improve business outcomes. With Microsoft's optimized AI software stack and the state-of-the-art architecture of the latest Azure VM series with NVIDIA GB200 Superchips, Azure clients can confidently take on their most ambitious AI projects, whether they are developing complex neural networks or leveraging pre-existing models with unique datasets to make them more business-relevant.

To facilitate co-validation and co-optimization, a limited private preview of Azure ND GB200 V6 virtual machines will be made available to chosen partners.

Post a Comment

0 Comments