What to expect in HPC for 2023
The high-performance computing industry has never looked stronger than it does going into 2023. We are seeing technology challenges in all directions as the demands on our systems become ever larger and more diverse. Let’s take a look at what we can expect to see in the next 12 months.
Exascale in 2023
The Aurora machine at Argonne National Laboratory and its testbed system, Polaris, show that exascale is really here. Polaris is already live and Aurora aims to deliver 2 exaflops of computing resources in 2023, a truly amazing human achievement given that the birth of modern computing is still just about within living memory. At Altair we’ve been working towards this point for many years, ensuring that our workload managers can cope with systems of this scale, and collaborating with large industrial customers such as Saudi Aramco and Johnson & Johnson, as well as public organisations, which is why Argonne chose Altair PBS Professional over their own in-house workload manager. When so much has been spent on exascale computing it is more important than ever to ensure high utilisation, which can only be achieved though efficient scheduling, flexible policies and commercial support for a stable product.
HPC Administrators Expect More
As systems continue to scale, customers are more demanding of the tool ecosystem that makes it practical to administer large HPC systems. Systems for cloud bursting, budgeting, and system monitoring, as well as tools for troubleshooting problem applications, are all now part of the chandelier of gear needed to be efficient and to deliver the best experience to HPC users. After all, it is not possible to make decisions about the system without good visibility into the jobs and infrastructure.
Machine Learning and Deep Learning Workloads
We’ve talked a lot about machine learning as HPC shifts and adopts this emerging field. I say emerging because we are still getting to grips with what the technology can and cannot do and the impact that it can have.
AI so far has mostly been a mix of supervised machine learning and data analytics. Deep learning is different in that the computers can learn unsupervised. As deep learning becomes more prevalent, we are going to see a further shift in workloads.
Initially, a lot of ML workloads were run on Kubernetes or other container orchestration frameworks. Some years ago there wasn’t much support for containers in HPC and many organisations let the data scientists operate their own systems, as their compute needs were small. As ML workloads have scaled, it has become clear that most dedicated container orchestration systems are designed for microservices, not for bursty, compute-intensive machine workloads. Now that comprehensive container support has been integrated with commercial HPC workload managers such as PBS Professional and Altair Grid Engine, it is becoming practical for organisations to pool their compute and take advantage of batch scheduling, cloud bursting and fairshare, which have long been key aspects of efficient HPC.
Hybrid Cloud Will Continue to Become the Norm
In fact, it is often the ML workloads that are driving a lot of organisations to set up hybrid cloud environments to augment their existing HPC. ML workloads are often highly parallel and short-lived with a deadline when the result becomes less valuable. This is exactly where cloud bursting provides the most value. A few years ago hybrid cloud meant a lot of scripting and significant in-house expertise, but there are now comprehensive tools that make it easy to configure budgets and automations that extend your HPC environment into the cloud without having to code it all yourself.
IoT and Streaming Data Workloads
It is tempting to think of IoT as a really irritating kettle that thinks it can make you tea before you know you want some. Something like the toaster in Red Dwarf, for those of you old enough and British enough to remember that.
However, IoT also encompasses our increasingly smart cities, transport systems and factories where digital twin is paving the way to a more sustainable future, using less energy and making machinery last longer. At Altair we sit at the convergence of simulation, machine learning and HPC, with digital twin spanning all those areas.
Workflows, Not Jobs, in Multidimensional Scheduling
Streaming workloads also include many life sciences applications as well as the occasional particle accelerator like Diamond Light Source. These big data, big HPC applications are fuelling a lot of the move from jobs to workflows and we are seeing an explosion in workflow tools as a result, with workflows often having many stages, each with their own data and acceleration requirements. It is my opinion that this transformation into multidimensional scheduling will be the biggest driver of change within HPC as we move to modernise the industry and adapt to these more exotic and connected applications.