Benchmarking: a vital step in cloud migration for HPC

Dr Rosemary Francis
4 min readFeb 2, 2023

--

Most high-performance computing centres now are looking to the cloud to boost their on-premises resources. The advantages are obvious: the cloud offers short-term increased capacity as well as additional capabilities such as GPUs or InfiniBand, which may not be available on-premises or may have long lead times for ordering. But the cloud is expensive, up to five times the total cost of running on-premises. So it is critical to both minimise costs and maximise return in the cloud to ensure that it is a tool that drives business and not just an increase in company spend.

Choosing what to migrate to the cloud

Choosing the right workloads is an organisation-specific journey. The right workloads to run in the cloud may be embarrassingly parallel and benefit from a large number of machines available at short notice. For example, Monte Carlo simulations are such workloads. With limited HPC resources they can take a long time to complete on a small corner of the HPC cluster, meaning that by the time the results are ready they are no longer needed. In the cloud, a large number of machines can be brought up, ensuring that the results are ready within a useful time frame.

Other workloads that are suitable for cloud migration may be those with simpler data requirements, meaning that a comprehensive cloud-native data strategy is easier to construct. Or workloads may be moved to the cloud to use specific capabilities such as hardware accelerators like GPUs or proximate hosted services like AWS Elasticsearch.

Preparing workloads for the cloud

Preparing workloads for the cloud and setting up HPC cloud bursting is not the journey into the unknown that is once was. Altair® Control™ has comprehensive budget controls to make sure that you spend within the allocated budget in the cloud, and that the budget is spent by the right users on the right projects with the expected results. Altair workload managers can spin up and spin down cloud resources with cloud connectors for all the major vendors. HPC in the cloud is no longer something you must do alone.

Benchmarking cloud resources

Regardless of the business driver behind cloud migration, it is vital to benchmark workloads in the cloud to ensure that you have selected the right resources. It would be madness to procure hardware for use on-premises without testing it first, and the cloud is no exception. There are many types of instances available on public cloud, each optimised for a different use case. It is important to select the right machine for the job.

Only in recent years have cloud vendors invested heavily in machines with HPC capabilities, and even then, HPC can mean a variety of different workloads. The high-throughput jobs in the semiconductor industry are very different from the wide geo-physical workloads of the oil and gas industry, and those are different again from the genome pipelines run in the life sciences. By benchmarking your workloads and selecting the right machines you will ensure that you get the results you need. Altair offers two solutions for benchmarking applications prior to migration and in the cloud. Altair® Breeze™ profiles resource utilisation in detail and detects application dependencies for migration. Altair® Mistral™ is an I/O monitoring solution that can characterise the jobs on HPC clusters at scale, delivering live monitoring and historical accounting of the way applications use storage. Together with Altair cloud bursting technology, they complete the solution for cloud migration and resource tuning.

Significant savings through right-sizing machines and storage

The Sanger Institute is part of a number of large projects such as the Tree of Life programme and the Cancer, Ageing and Somatic Mutation Programme. They use Breeze and Mistral to monitor I/O patterns and to tune workloads for different compute platforms. By profiling genomics pipelines with Mistral they were able to save 10% of cloud costs with just a few hours of effort. This was starting from a machine well suited to HPC jobs, but we’ve seen much larger savings from customers who were using machines that were very ill suited to the workloads they were running. It is common for inexperienced users to choose large machines, such as those designed to serve websites or microservices, thinking that they will accelerate their HPC workloads, when in fact they are expensive and slow for scientific simulation.

Migrating applications to the cloud

Altair Breeze is typically where customers start when migrating an application because it can characterise the needs of that application in terms of CPU, memory and storage. It also records application dependencies so when containerising applications for migration or when mirroring data in the cloud, Breeze tells you which files, scripts and programs you need to move. Once migrated if there are any performance issues then Breeze lets you dig into any bottlenecks to resolve those remaining problems.

Customers often then switch to Altair Mistral to run benchmarking tests at scale and to continuously monitor cloud usage. Mistral collects per-job CPU, memory and I/O information. The I/O data is broken down by file system and into reads, writes and meta-data operations such as open() or delete(). This overview means that you can be sure that your applications are running efficiently and that you have made the right choices in your storage and compute infrastructure.

Once applications have been migrated to the cloud and suitable instances selected, you should then set the right policies to ensure that users cannot run their jobs on the wrong machine types. Control makes this easy with configurable policies that lock down machines types, regions and other options. It also lets you set budgets on a user or project level to ensure that cloud spend is used as intended. By templating cloud use in this way you can keep a good handle not only on the overall budget, but also on subtleties, such as rate of spend, that make it harder for users to waste their allocation.

I don’t think that the journey to the cloud will be a push-button experience for a long time, but by using these tried and tested methodologies and tools, it doesn’t have to be a journey you undertake alone.

--

--

Dr Rosemary Francis

Chief Scientist for HPC at Altair. Fellow of the Royal Achademy of Engineering. Member of the Raspberry Pi Foundation. Entrepreneur. Mum. Windsurfer