AMD EPYC 9754 Compares to Grace Superchip

 

AMD EPYC 9754 Benchmark

AMD EPYC 4th Gen Outperforms NVIDIA Grace Superchip in Performance and Efficiency
Industry benchmarks indicate that single- and dual-socket 4th Gen AMD EPYC systems outperform 2P NVIDIA Grace CPU Superchip systems by ~2.00x to ~3.70x and have more than double the energy efficiency.

My previous blogs showed that 4th Gen AMD EPYC processors beat 5th Gen Intel Xeon Platinum and Ampere Altra Max M128-30  CPUs for critical applications. They’ll evaluate 4th Gen AMD EPYC and NVIDIA Grace CPU Superchip  CPUs’ performance and energy efficiency.

Continuous innovation drives 4th Gen AMD EPYC CPUs to create new standards in datacenter performance, power efficiency, security, and TCO. The 4th Gen AMD EPYC processor range provides cutting-edge on-premises and cloud-based solutions for today’s demanding, diverse workloads.

Over 250 server architectures and 800 cloud instances are supported by AMD EPYC. AMD EPYC processors have over 300 world records in commercial applications, technical computing, data management, data analytics, digital services, media and entertainment, and infrastructure solutions.

The Grace CPU Superchip from NVIDIA was released with eye-opening performance comparisons. These claims must be carefully assessed because their benchmark results are minimal and lack key system setup data. With comprehensive industry-standard benchmark releases, AMD continually shows superior performance and power efficiency.

AMD EPYC processors have 5927 official SPEC CPU 2017 publications, while NVIDIA Grace has none. As you will see, 4th Gen AMD EPYC processors outperform NVIDIA  CPUs in energy efficiency and performance.

Please note that this blog covers only a portion of the tested workloads. ARM-based NVIDIA Grace can run a limited number of workloads due to compatibility concerns with x86 processing architecture, which enables enterprise, cloud-native, and HPC applications.

AMD tested several systems with single-socket and dual-socket AMD EPYC 9754 (code name “Bergamo,” with 128 cores and 256 threads/vCPUs), dual-socket AMD EPYC 9654 (code name “Genoa,” with 96 cores and 192 threads/vCPUs), and NVIDIA Grace processors. Unless otherwise stated, each AMD EPYC system had 12 × 64GB DDR5-4800 memory per socket. NVIDIA used the highest server-supported LPDDR5-8532 memory of 480 GB.

Efficiency of Power

Modern data centres must handle rising demand while optimizing electricity use to cut costs and promote sustainability. By assessing the System Under Test’s power and performance, the Standard Performance Evaluation Corporation (SPEC) power ssj 2008 benchmark compares volume server class computers’ energy efficiency.

Both single- and dual-socket AMD EPYC 9754 systems beat NVIDIA Grace systems by ~2.50x and ~2.75x, respectively (Figure 1). Moreover, a dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~2.27x in the same tests.

General-purpose computing

For computer system performance testing, SPEC created the SPEC CPU 2017 benchmark set. A leading industry standard for evaluating general-purpose computing infrastructure, SPECrate 2017_int_base ratings evaluate integer performance.

Single- and dual-socket AMD EPYC 9754 systems outscored NVIDIA Grace systems by approx. 1.33x and 2.64x, respectively. Additionally, a dual-socket AMD EPYC 9654 system outperformed the same NVIDIA system by ~2.43x in the same tests.

Java server-side

4th Gen AMD EPYC  CPUs provide cloud native workloads without sacrifice or costly architectural upgrades. Java is used everywhere in enterprise and cloud contexts. The SPE jbb 2015 benchmark models an e-commerce corporation with an IT infrastructure that handles point-of-sale requests, online purchases, and data-mining operations to evaluate Java-based application performance on server-class hardware.

Single- and dual-socket AMD EPYC 9754 computers outscored NVIDIA Grace by ~1.81x and ~3.58x, respectively. Additionally, the dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~3.36x in SPECjbb2015-MultiJVM max-jOPS tests.
AMD EPYC computers use SUSE Linux Enterprise Server 15 SP4 v15.14.21 with Java SE 21.0 for 9654 and 17.0 LTS for 9754. NVIDIA Grace system running Ubuntu 22.04.4 (kernel v15.15.0-105-generic) and Java SE 22.0.

Transactional Databases

MySQL is a popular open-source relational database technology in enterprise and cloud environments. AMD assessed online transaction processing using HammerDB TPROC-C. The HammerDB TPROC-C workload, generated from the TPC-C Benchmark Standard, does not match published TPC-C results as it does not meet the standard.

Single- and dual-socket AMD EPYC 9754 computers outscored NVIDIA Grace by ~1.58x and ~2.16x, respectively. Additionally, the dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~2.17x in the same tests.

Ubuntu 22.04, MySQL 8.0.37, and HammerDB 4.4 were installed on the test systems. Multiple 16-core VMs were on each system. Three test runs’ medians were compared.

System Supporting Decisions

Decision Support System deployments use MySQL extensively. AMD evaluated Design Support System performance with HammerDB TPROC-H. HammerDB TPROC-H workload results do not meet with the TPC-H Benchmark Standard, hence they cannot be compared to published TPC-H results.

As shown in single- and dual-socket AMD EPYC 9754 systems outscored NVIDIA Grace systems by ~1.42x and ~2.98x, respectively. Additionally, the dual-socket AMD EPYC 9654 system outperformed the same system by ~2.62x in the same tests.

Ubuntu 22.04, MySQL 8.0.37, and HammerDB 4.4 were installed on the test systems. Multiple 16-core VMs were on each system. Three test runs’ medians were compared.

The Web Server

The flexible web server NGINX may be a reverse proxy, load balancer, mail proxy, and HTTP cache. Customized for online content and customers. For performance and security, NGINX can run as a standalone web server or reverse proxy. AMD tested WRK with heavy HTTP loads.

In Figure 6, single- and dual-socket AMD EPYC 9754 computers outscored NVIDIA Grace by ~1.27x and ~2.56x, respectively. Additionally, the dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~1.89x in the same tests.

The server and client were on the same system in this benchmark test to reduce network delay and estimate CPU processing power. The systems ran Ubuntu 22.04 and NGINX 1.18.0. Multiple 8-core instances ran on each system. Each run assessed the workload for 90 seconds, and the median requests per second (rps) from 3 runs per platform were averaged to compare performance.

In-memory analytics

A powerful in-memory distributed key-value database, cache, and message broker with optional persistence, Redis. AMD used redis-benchmark to test Redis servers.

Single- and dual-socket AMD EPYC 9754 beat NVIDIA Grace by ~1.15x and ~2.29x, respectively (Figure 7). Additionally, the dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~1.54x in the same tests.

Ubuntu 22.04, Redis 7.0.11, and redis-benchmark 7.2.3 were installed. Each client established 512 GET/SET connections with 1000-byte keys to its Redis server. Multiple 8-core instances ran on each system. Three runs of the workload test ran 10 million requests on each system, and the median requests per second (rps) statistics were aggregated to compare performance.

Cache Tier

The high-performance, distributed in-memory caching system Memcached stores key-value pairs for short amounts of arbitrary data like strings or objects. It caches rendered pages and database or API calls. AMD used the popular memtier benchmarking tool to evaluate latency and throughput improvements.

Single and dual-socket AMD EPYC 9754 computers outscored NVIDIA Grace by ~1.16x and ~2.26x, respectively. Additionally, the dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~1.97x in the same tests.

The computers ran Ubuntu v22.04, Memcached v1.6.14, and memtier v1.4.0. Each memtier client had 10 connections, 8 pipelines, and a 1:10 SET/GET ratio with its Memcached server. Multiple instances with 8 cores ran on each system. To compare performance, the workload test processed 10 million requests on each system and averaged the median requests per second (rps) from three runs per platform.

High-performance computing

HPC impacts every sector of our lives where performance is critical, from manufacturing to life sciences. ARM processors excel at lighter workloads but struggle with HPC and crucial data-centric tasks. Some apps have been ported to ARM, however they lack the advanced features of x86 processors that increase HPC performance. Another factor is memory capacity: 4th Gen AMD EPYC  CPUs can support 3 TB, whereas ARM-based NVIDIA Grace can only support 480 GB.

Compiling HPC apps for ARM processors and diagnosing runtime errors is difficult. When testing common open-source HPC workloads for fast turnaround times, AMD engineers encountered the following issues:

  • Runtime issues occur despite source code updates in NAMD.
  • GROMACS compiles incorrectly and requires source adjustment.
  • OpenRadioss fails to compile and needs Pull Requests and adjustments for ARM instructions.
  • WRF and OpenFOAM dependencies do not compile.

AMD engineers easily compiled and tested these open-source HPC workloads:

  • HPL needs modest cmake adjustments to compile and run with the ARM performance math library.
  • Scalapack is not included in the ARM performance math library, yet Quantum ESPRESSO compiles and runs.
  • These issues highlight the importance of AMD EPYC chips’ x86 CPU architecture compatibility between generations. Compare the performance of these two workloads.

High-performance Linpack

HPC cluster floating-point performance is measured by High Performance Linpack (HPL). This portable implementation tests the cluster’s capacity to solve dense linear unary equations of a particular degree using Gaussian elimination. AMD ran HPLinpack 2.3 matrix.

Figure 9 shows that the dual-socket AMD EPYC 9754 machine outperformed NVIDIA Grace by ~2.34x. Additionally, a dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~1.97x in the same tests.

NVIDIA Grace runs Red Hat Enterprise Linux 9.4 with kernel 5.14.0-427.18.1.el9_4.aarch64+64k and has 480GB of LPDDR5X-8532 memory, while AMD EPYC 9754 and 9654 have 1.5 TB of DDR5-4800.

Quantum ESPRESSO

Open-source Quantum ESPRESSO calculates nanoscale electronic structures and materials using density-functional theory, plane waves, and pseudo potentials. Comparing system performance with Quantum ESPRESSO 7.0 ausurf benchmark.

The dual-socket AMD EPYC 9754 computer outscored NVIDIA Grace by ~4.08x. Additionally, a dual-socket AMD EPYC 9654 machine outperformed the NVIDIA system by ~3.46x in the same tests.

NVIDIA Grace runs Red Hat Enterprise Linux 9.4 with kernel 5.14.0-427.18.1.el9_4.aarch64+64k and has 480GB of LPDDR5X-8532 memory, while AMD EPYC 9754 and 9654 have 1.5 TB of DDR5-4800.

Video Encoding

From classic to modern, FFmpeg can encode, decode, convert, stream, filter, and play most video formats. Its format conversion, video resizing, editing, and seamless streaming make it popular.

Shows that AMD EPYC 9754 computers outpaced NVIDIA Grace systems by ~1.60x and ~2.90x, respectively. Additionally, the dual-socket AMD EPYC 9654 system outperformed the NVIDIA system by ~2.38x in the same tests.


The systems ran Ubuntu 22.04 and FFmpeg 4.4.2. Each FFmpeg instance transcoded a 4K input file into raw video using the VP9 codec. Multiple 8-core instances ran on each system. Each system was evaluated by its median total frames processed per hour over three test runs.

AMD EPYC 9754 Price

The AMD EPYC 9754 processor costs around $11,900 at launch. Its high-end server capabilities and top-tier EPYC lineup position impact its pricing (CPU / processor comparison) (Club386).

Conclusion

General-purpose AMD EPYC 9654 and, most critically, cloud-native AMD EPYC 9764 processors outperformed NVIDIA Grace in eleven fundamental, SQL, and HPC applications. This and my June blog statistics show that 4th Gen AMD EPYC processors remain market leaders in performance and power efficiency.

With the upcoming release of 5th Gen AMD EPYC processors (codenamed “Turin”) with up to 192 cores per processor and a fully compatible x86 architecture that runs your current and future workloads, AMD will extend our performance and energy efficiency leads. Let’s keep this a secret until then.

Post a Comment

0 Comments