When the UK’s national weather forecaster, the Met Office, announced its next supercomputer purchase would actually be a cloud subscription, that was controversial: it recently settled out of court with its previous IT supplier after claims the decision was unfair. What it chose was a traditional supercomputer from a non-traditional supplier: selecting four Cray EX systems on Azure to “spend less time buying supercomputers and more time utilizing them" as Met Office IT Fellow Richard Lawrence phrased it.
As well as skipping the usual multi-year purchasing cycle, it’s buying the flexibility to make a different choice about HPC workloads in the future far more quickly than if it was acquiring its own hardware. Moving HPC to the cloud is also an opportunity to get rid of internal siloes in the way you handle data for analytics, modernizing practices that often date back two or more decades, Gartner suggests.
Pros and Cons of Cloud HPC
Offering access to Cray hardware gave public cloud providers credibility by demonstrating they understand what the most demanding subset of HPC users need. But it also provided a steppingstone to the benefits of cloud HPC by using clusters of Linux servers for demanding scale out workloads alongside the more familiar advantages of cloud: flexibility, agility, and a low barrier to entry.
“The biggest thing is the ability to choose heterogeneous versus homogenous infrastructure,” Tracy Woo, senior analyst at Forrester told Data Center Knowledge. The high cost of HPC infrastructure means most buyers pick a single brand so they can negotiate a deal and every workload has to run on that, whether the configuration is a good fit or not. “You use what you have even if it’s not what you specifically need. With public cloud, you have every single infrastructure [option] you could possibly want or need for your specific use case.”
Cloud HPC lets you specify exactly what the application needs, offering a mix of familiar Intel and AMD processors alongside less expensive Arm processors, with fast CPUs and GPUs, dense core counts and high memory per core. You also get access to hardware accelerators most organizations wouldn't have the budget or expertise for but can now experiment with easily.
“I'm no longer spending hundreds of thousands of dollars – or even millions in some cases – on infrastructure equipment,” Woo noted. “It just requires a credit card to run calculations or specific analytics that require high performance compute for only a few hours.” You can pick the right infrastructure for each workload or even each job and benchmark new hardware as it arrives on the market rather than waiting for your next refresh cycle.
But that flexibility can also be confusing, Woo warned, with ‘analysis paralysis’ induced by the multiplicity of choices – and an industry of tools and platforms springing up to try to help organizations make those choices.
“Administrators have enormous freedom of choice but must also have a deep understanding of the unique architecture of their cloud provider of choice,” agreed Timothy Costa, director for HPC and quantum products at NVIDIA. “For instance, they can combine a wide range of compute hardware on high-speed networks to optimize their infrastructure designs, but not all types of hardware are available in all regions.”
Determining Which HPC Workloads Are Better Suited to the Cloud
The proportion of HPC workloads running in the cloud doubled in 2019, from 10% to 20% according to Hyperion Research; Gartner rates cloud HPC as a high-benefit option that’s only 2-5 years away from mainstream adoption.
Manufacturing and life sciences were the first to move HPC to the cloud and remain the fastest growing segments. These workloads tend to be “highly parallelized codes or job ensembles with a high tolerance for individual job failure and little concern for execution locality” Costa said, noting that finance, weather, aerospace, and government labs are increasing their cloud HPC use, as is higher education.
Cloud HPC is a particularly good option for “long-tail HPC workloads where performance versus cost is more critical than absolute runtime” where you can take the time to set up cloud infrastructure that minimizes cost, or for code that “dramatically benefits from hardware not available on-prem.”
Hyperscale data centers built for cloud IaaS prioritize different optimizations than HPC supercomputers: Spreading VMs across the data center allows for resilience and failover but HPC packs VMs closely together to get the fastest possible network connections for performance. Cloud HPC has been best suited to loosely coupled, very parallel workloads and cloud networks will “easily satisfy the needs of ensemble or parameter sweep HPC workloads,” Costa said.
Virtualized performance can be unfamiliar to those used to bare metal HPC, but a virtual supercomputer built on Azure was among the top ten fastest machines in the world in the 2021 November Top500 list, using entirely Hyper-V-based VMs. “Compute-optimized VMs [in cloud] provide near bare-metal performance with low jitter, cloud networks provide 200Gbs+ bandwidth and <10µs latency and parallel file systems deliver data with TB/s speed,” Bill Magro, chief HPC technologist at Google Cloud, told us.
Common workloads include:
- Computer-aided engineering (such as, fluid dynamics, combustion, crash safety, structural mechanics)
- Electronic design automation,
- Computational physics and chemistry
- Special effects rendering
- Quantitative analysis
- Risk analytics
Exploring Cloud Fabric Options
Some workloads require the consistently low latency of high-performance interconnects that have been rare in cloud. If your cloud provider doesn’t offer that, [high-performance interconnects] are better suited to your own infrastructure, Woo suggested. However, AI and cloud gaming workloads also benefit from high-speed interconnects, so high-speed fabrics are starting to arrive in cloud. Azure offers the familiar HPC InfiniBand interconnect on all its H-series clusters (for CPU-based HPC) and most N-series clusters (for GPU-based HPC) available globally, while AWS Elastic Fabric Adapter works with the Lustre parallel filesystem.
AWS recently extended its proprietary Elastic Network Adapters with a new network transport protocol designed to run on its custom Nitro network adaptors as an alternative to InfiniBand: Elastic Network Adapter Express uses Scalable Reliable Datagram (SRD) instead of TCP in an attempt to turn the many network paths in multitenant data centers into an advantage rather than a limitation.
“The networking infrastructure has really held [cloud HPC] back and been something of a bottleneck, so this is something hyperscalers are really focused on right now,” Woo told us.
Keys to Understanding HPC Cloud Costs
With cloud HPC, you’re only using – and paying for – what you need but what you pay may be a little more. While some estimates put costs at as much as five times higher than running your own infrastructure, that drops much closer to parity if you use reserved or spot instances.
“While cloud flexibility helps minimize cost overall, the absolute per-unit cost of cloud-hosted resources is higher than on-prem,” Costa agreed. That means it makes sense to keep long-running HPC workloads that fully utilize on-prem resources where they are. On the other hand, “infrequently executed workloads requiring a large capacity of resources for a small duration can be far less expensive in the cloud, as opposed to setting up an on-prem environment,” Dori Exterman, Incredibuild’s CTO argues.
HPC cloud automation platform Rescale suggests many organizations can improve performance and cut costs for their workloads by using its benchmarks to pick the most suitable cloud hardware. The best fit for your workload can change quickly in cloud, but you also have the flexibility to switch – as long as you can stay on top of the options.
Unless you’re already implementing chargebacks or have clear policies on using resources, HPC users may assume on-prem infrastructure is effectively free – or they may struggle with estimating how long workloads will take to run or how many instances they will need; those would be expensive habits to take to cloud HPC, so you need clear policies and guidelines on how to budget for workloads.
Cloud HPC can be particularly helpful for offloading smaller HPC jobs that are often stuck in resource management queues for a disproportionately long time, because HPC infrastructure is typically oversubscribed by large, long-running jobs taking up many resources. That’s good for utilization and ROI but frustrating for teams waiting for access: cloud HPC can help them meet deadlines or run larger and more complex simulations for better results.
If you’re adopting a hybrid HPC model that bursts to cloud, develop a framework to decide which jobs should move to the cloud and when.
For example, Hyperion Research’s cloud application assessment tool scores different workloads as more or less suitable to running in the cloud.
Considering Costs of Data Gravity and Data Egress
You also need to think about data gravity and data egress costs. If data is generated locally, plan how to move it to the cloud. If your HPC job generates terabytes of data, look to post-process or analyze that in the cloud to avoid paying extra to get the results back. “Storage costs can be a surprisingly significant fraction of the cloud bill, Costa noted.
Using any cloud resources demands good cost control and FinOps tools. That’s even more important for cloud HPC because the bill can potentially be very large and small changes in the infrastructure selected can save significant amounts.
But cloud isn’t primarily about saving money, Woo noted, “It’s about your ability to pivot your ability, to be agile, your ability to use all these different services.” That can mean getting results faster, getting better results by running more simulations in the same amount of time or just improving the productivity of both IT teams and HPC users.
“Often HPC is the primary tool for IP development, so it can’t go offline,” Costa pointed out. “With cloud, backup, migration and regional failover can be built in.”
A Word of Caution on Cloud Licenses
To manage cloud licenses, you can use familiar HPC software such as:
- Compilers, job submission tools, and schedulers (like Altair PBSPro, SchedMD Slurm, IBM Platform LSF, Altair GridEngine, and HT Condor)
- Management and monitoring tools
- Operating systems, applications, message-passing, and math libraries
- Complete solutions like NVIDIA Bright Cluster Manager or tools like OpenHPC in cloud
“HPC users - often scientists, engineers, quants, and artists - can access cloud HPC systems through the same applications and interfaces they use on premise”, Magro pointed out. “Management tools that rely on lower-level physical platform interfaces, such as IPMI, Redfish, or vPro are generally not compatible with cloud resources unless specifically enabled by their authors,” he warned, but noted alternatives like Nagios work for cloud.
One area where FinOps tools may let you down is handling software licenses you already have and want to use in cloud – and ITAM teams who handle on-premises licensing often lack cloud expertise.
‘Bring your own license’ might save you money in cloud HPC or you might find that your software vendor has different licensing models for cloud. This is a tricky area, she warned: Oracle, for example, is “famous for making it very difficult for people to operate outside of outside of their public cloud.”
You also need to consider the skills gap. “Just hiring someone who understands public cloud is hard: hiring someone who understands high performance computing and public cloud is even more difficult.”
Exploring Less-traditional HPC
But HPC in the cloud can be an opportunity to move up the stack a little.
Cloud HPC services like Google’s HPC Cloud Toolkit offer blueprints for common workloads with infrastructure defined by familiar cloud tools like Terraform, Ansible and Packer.
Simulation is a classic HPC workload, but cloud services like AWS SimSpace Weaver, Siemens Simcenter Cloud HPC (which runs traditional HPC software on AWS as a service) and Microsoft’s Project AirSim (for building, training and testing autonomous aircraft) make it easier to run simulations at sufficient scale without provisioning and managing infrastructure directly.
Another option is replacing or supplementing HPC with native cloud offerings, whether that’s calling an API or distributing and orchestrating compute using containers or serverless platforms. The San Diego Supercomputer Center is using GPU sharing on spot VMs in Google Kubernetes Engine to speed up the photon simulation code for the IceCube Neutrino Observatory at the South Pole.
For AI workloads like prediction and advanced analytics, you may get similar levels of insights without the HPC infrastructure by using pre-built but customizable options like Azure Cognitive Services (which include the new OpenAI models) that you can call as an API.
AI workloads used for data-driven decisions at scale have complex integration needs and are often deployed alongside enterprise applications, James Read, principal solution architect at Red Hat, noted, “this is prompting a shift away from traditional bare metal deployments towards container-based, Kubernetes-orchestrated hybrid cloud platforms that enable HPC solutions to be deployed at the edge and in the cloud.”
If you’re supplementing an existing HPC solution with these cloud services, moving that workload to the cloud can simplify integration.