News & Blogs

Migrating HPC Workloads to the Cloud

While public clouds have been used for HPC for at least the past decade, several key challenges have traditionally slowed the adoption of cloud computing for HPC:

Slow, high-latency interconnects
High virtualization overhead and poor isolation from other workloads
Data staging challenges and costs (getting large data sets into and out of the cloud)
Moving from always-on, multi-user batch environments to on-demand cloud resources

Recently, the HPC offerings of public cloud vendors are starting to mature into viable platforms to support virtually any workload, whether traditional MPI-based HPC applications or machine learning/AI workflows. A few key developments mark a sea change in cloud HPC capabilities:

Low-latency interconnects: The availability of InfiniBand or accelerated Ethernet options are a game-changer for running latency-sensitive distributed applications.

Whole-node virtualization: Cloud providers now offer virtual machine instances that have exclusive use of the entire physical machine, and have decreased virtualization overhead. In some cases, special CPUs offload virtualization onto dedicated cores, leaving the full core count to the operating system and user workloads. Current offerings also include compute nodes with GPU resources that are frequently used for machine learning.

Node affinity: For multi-node distributed applications, node affinity, or colocation of the compute nodes topologically (on the same network subnet) and physically (e.g. in the same rack) is important to achieve low-latency communications. Isolation from other workloads and network traffic is desirable, as one slow connection can have a dramatic impact on application performance. Major cloud vendors have HPC-friendly virtual cluster deployment features that help enforce node affinity.

High-performance storage options: The availability of high-performance compute, disk, and interconnects in the cloud enable roll-your-own HPC storage as an option. However, several vendors offer preconfigured high-performance storage using Lustre, IBM Spectrum Scale, BeeGFS, or other parallel file systems. Some vendors offer “storage as a service” options that can dynamically create a given amount of high-performance storage on demand, while enabling Hierarchical Storage Management (HSM) features to back it up to a less expensive persistent file or object storage tier in the cloud.

Easy deployment: Several vendors have templates for deploying virtual clusters with a preconfigured workload manager (such as Slurm) in a single command or click. This approach deploys an environment familiar to HPC users, easing adoption through familiarity and portability. Coordination between the workload manager and the virtualization back-end can automatically provision or deprovision resources based on demand.

Flexible job submission options: Vendors offer SSH access to a “head node” in the cloud, APIs to launch and interact with jobs from one’s laptop or workstation, or “cloudbursting,” where users submit jobs to a local compute cluster that forwards the job to cloud resources as needed.

These developments broaden the array of workloads that may successfully be run in the cloud, and make cloud resources much easier to utilize.

Cloud options for HPC

Several models exist for utilizing cloud resources for HPC:

Long-term deployment: Instantiate compute and storage resources and leave them available for users over the course of a project or period of peak use. While this simple approach is most like a traditional cluster, it may not be cost-efficient, as in many scenarios one pays by the hour for resources, whether they are used or not.

Per-job deployment: Submit a job to run in the cloud through an API or Web interface, and the required compute, storage and networking resources will automatically be provisioned and configured for running it. While this approach helps to reduce cloud charges for idle periods, the setup for this approach to work seamlessly and efficiently is more complicated, and adds a significant amount of startup and tear-down time to the job (which incurs cloud charges).

Cloudbursting: In a traditional HPC batch environment, one can “burst” workload to cloud resources when demand requires additional compute capacity. In this case, the local workload manager makes an API call to dynamically provision virtual instances and storage in the cloud, runs the job in the cloud, and retrieves results back to the local cluster. Virtual resources that are no longer needed can be automatically deprovisioned to reduce costs.

Advantages and challenges of the Cloud for HPC

Cloud computing’s advantages for HPC are similar to that of other use cases:

Avoid impact of test and development work on production, mission-critical workloads.
Provide temporary capacity for peak workload periods that do not justify a capital outlay.
Always have access to the “latest and greatest” processor models.
Test and debug applications for HPC for specific hardware and software configurations without the need to own or make a permanent investment.
Test out new processors, GPUs, or storage options before buying.
Facilitate collaboration with outside development resources without providing or requiring the security credentials necessary to access internal resources.

Although much progress has been made in supporting HPC in the cloud, key challenges remain, particularly the availability of low-latency interconnects. In some cases, custom and updated software stacks are required to utilize a vendor’s high-performance networking option. Also, users may be used to always-available physical infrastructure and data storage, so job submission, workflows, and data staging may need to be rethought in a cloud environment.

Considerations

To determine how public cloud providers may best be used to support your HPC mission, a number of questions and considerations must be taken into account, including:

What applications can or should be moved to the cloud?
Which utilization model best fits our use case and needs?
What level of storage and compute interconnect performance is required?
Data staging considerations: how will input data be made available to jobs? What high-performance storage option is best for our needs?
Storage and data management: what data should we keep in the cloud, and how do we store it economically?

Also, to what extent can cloud providers partially or completely replace private “on-prem” resources? In general, it is often cheaper to own than to rent over the long term, but there can also be significant lead time, facilities, and budgeting issues involved when deploying on-prem solutions. Deciding on how much of your workload can move to the cloud, when this is appropriate, and how it will be achieved, depends on a complex mixture of cost, performance, and feasibility considerations.

When developing your strategy for a user-ready HPC deployment to the cloud, it is helpful to have someone with deep HPC and cloud experience to guide your decisions. If you are interested in transitioning HPC workload to the cloud, reach out to us to see how we can help.