The AWS Parallel Computing Service for HPC workloads

PCS AWS

AWS launching AWS Parallel Computing Service (AWS PCS), a managed service that allows clients build up and maintain HPC clusters to execute simulations at nearly any scale on AWS. The Slurm scheduler lets them work in a familiar HPC environment without worrying about infrastructure, accelerating outcomes.

AWS Parallel Computing

Run HPC workloads effortlessly at any scale.

Why AWS PCS?

AWS Parallel Computing Service (AWS PCS) is a managed service that simplifies HPC workloads and Slurm-based scientific and engineering model development on AWS. PCS AWS lets you create elastic computing, storage, networking, and visualization environments. Managed updates and built-in observability features make cluster management easier with AWS PCS. You may focus on research and innovation in a comfortable environment without worrying about infrastructure.

Benefits

Focus on labor, not infrastructure

Give users comprehensive HPC environments that scale to run simulations and scientific and engineering modeling without code or script changes to boost productivity.

Manage, secure, and scale HPC clusters

Build and deploy scalable, dependable, and secure HPC clusters via the AWS Management Console, CLI, or SDK.

HPC solutions using flexible building blocks

Build and maintain end-to-end HPC applications on AWS using highly available cluster APIs and infrastructure as code.

Use cases

Tightly connected tasks

At almost any scale, run concurrent MPI applications like CAE, weather and climate modeling, and seismic and reservoir simulation efficiently.

Faster computing

GPUs, FPGAs, and Amazon-custom silicon like AWS Trainium and AWS Inferentia can speed up varied workloads like creating scientific and engineering models, protein structure prediction, and Cryo-EM.

Computing at high speed and loosely linked workloads

Distributed applications like Monte Carlo simulations, image processing, and genomics research can run on AWS at any scale.

Workflows that interact

Use human-in-the-loop operations to prepare inputs, run simulations, visualize and evaluate results in real time, and modify additional trials.

AWS ParallelCluster

In November 2018, AWS launched AWS ParallelCluster, an AWS-supported open-source cluster management tool for AWS Cloud HPC cluster deployment and maintenance. Customers can quickly design and deploy proof of concept and production HPC computation systems with AWS ParallelCluster. Open-source AWS ParallelCluster Command-Line interface, API, Python library, and user interface are available. Updates may include cluster removal and reinstallation. To eliminate HPC environment building and operation chores, many clients have requested a completely managed AWS solution.

AWS Parallel Computing Service (AWS PCS)

PCS AWS simplifies AWS-managed HPC setups via the AWS Management Console, SDK, and CLI. Your system administrators can establish managed Slurm clusters using their computing, storage, identity, and job allocation preferences. AWS PCS schedules and orchestrates simulations using Slurm, a scalable, fault-tolerant work scheduler utilized by many HPC clients. Scientists, researchers, and engineers can log into AWS PCS clusters to conduct HPC jobs, use interactive software on virtual desktops, and access data. Their workloads can be swiftly moved to PCS AWS without code porting.

Fully controlled NICE DCV remote desktops allow specialists to manage HPC operations in one place by accessing task telemetry or application logs and remote visualization.

PCS AWS uses familiar methods for preparing, executing, and analyzing simulations and computations for a wide range of traditional and emerging, compute or data-intensive engineering and scientific workloads in computational reservoir simulations, electronic design automation, finite element analysis, fluid dynamics, and weather modeling.

Starting AWS Parallel Computing Service

AWS documentation article for constructing a basic cluster lets you try AWS PCS. First, construct a VPC with an AWS CloudFormation template and shared storage in Amazon EFS in your account for the AWS Region where you will try PCS AWS. AWS literature explains how to create a VPC and shared storage.

Cluster

Select Create cluster in the PCS AWS console to manage resources and run workloads.

Name your cluster and select your Slurm scheduler controller size. Cluster workload limits are Small (32 nodes, 256 jobs), Medium (512 nodes, 8,192 tasks), and Large (2,048 nodes, 16,384 jobs). Select your VPC, cluster launch subnet, and cluster security group in Networking.

A resource selection method parameter, an idle duration before compute nodes scale down, and a Prolog and Epilog scripts directory on launched compute nodes are optional Slurm configurations.

Create cluster. Provisioning the cluster takes time.

Form compute node groupings

After constructing your cluster, you can create compute node groups, a virtual grouping of Amazon EC2 instances used by PCS AWS to enable interactive access to a cluster or perform processes in it. You define EC2 instance types, minimum and maximum instance counts, target VPC subnets, Amazon Machine Image (AMI), purchasing option, and custom launch settings when defining a compute node group. Compute node groups need an instance profile to pass an AWS IAM role to an EC2 instance and an EC2 launch template for AWS PCS to configure EC2 instances.

Select the Compute node groups tab and the Create button in your cluster to create a compute node group in the console.

End users can login to a compute node group, and HPC jobs run on a job node group.

Use a compute node name and a previously prepared EC2 launch template, IAM instance profile, and subnets to launch compute nodes in your cluster VPC for HPC jobs.

Next, select your chosen EC2 instance types for compute node launches and the scaling minimum and maximum instance count.

Select Create. Provisioning the computing node group takes time.

Build and run HPC jobs

After building compute node groups, queue a job to run. Job queued until PCS AWS schedules it on a compute node group based on provisioned capacity. Each queue has one or more computing node groups that supply EC2 instances for processing.

Visit your cluster, select Queues, and click Create queue to create a queue in the console.

Select Create and wait for queue creation.

AWS Systems Manager can connect to the EC2 instance it creates when the login compute node group is active. Select your login compute node group EC2 instance in the Amazon EC2 console. The AWS manual describes how to create a queue to submit and manage jobs and connect to your cluster.

Create a submission script with job requirements and submit it to a queue with the sbatch command to perform a Slurm job. This is usually done from a shared directory so login and compute nodes can access files together.

Slurm may perform MPI jobs in PCS AWS. See AWS documents Run a single-node job with Slurm or Run a multi-node MPI task with Slurm for details.

Visualize with a fully managed NICE DCV remote desktop. Start with the HPC Recipes for AWS GitHub CloudFormation template.

After HPC jobs using your cluster and node groups, erase your resources to minimize needless expenses. See AWS documentation Delete your AWS resources for details.

Know something

Some things to know about this feature:

Slurm versions – AWS PCS initially supports Slurm 23.11 and enables tools to upgrade major versions when new versions are added. AWS PCS also automatically patches the Slurm controller.

On-Demand Capacity Reservations let you reserve EC2 capacity in a certain Availability Zone and duration to ensure you have compute capacity when you need it.

Network file systems—Amazon FSx for NetApp ONTAP, OpenZFS, File Cache, EFS, and Lustre can be attached to write and access data and files. Self-managed volumes like NFS servers are possible.

Now available

US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) now provide AWS Parallel Computing Service.