As a professional with access to a high performance computing environment, you already know it represents a scarce, precious, and expensive resource. You may have to schedule time in advance, and only get to work with the cluster for a limited period of time. You want to make the best possible use of HPC when it’s available to you. It could be a deadline for some kind of forecast or analysis; it might be tied to the release of a product or service; or it might just have to be done in time for this evening’s news.
Tuning code that runs in an HPC environment means improving its efficiency and execution time. This could translate into completing more tasks during your available time window on an HPC system, or that you can run an analysis with a larger data set, or at higher resolution, in the same time you used to run something smaller or less detailed. Overall, the real name of the optimization game is productivity: getting more done, or producing more and better results using the same resources you’ve used before.
The payoffs in an HPC environment can be tremendous, because even a small improvement in efficiency, processing, or throughput can produce substantial increases in productivity.
Approaching Optimization from a Productivity Perspective
As you examine the code for your HPC application(s), there are numerous techniques you can employ to make better use of the massive processing power and parallelism that such computing environments provide. Here are few ideas you should consider as you review your code base for optimization opportunities:
- Multi-threading and many processors speed job completion times. Even if your code can’t keep one physical processor/core constantly busy, if you can distribute that code over two or more virtual processors you could complete more work at a faster pace. As long as the distributed job doesn’t run twice as slow (for two processors), or n times as slow (for n processors), if you can run more tasks/threads at the same time without waiting in a queue, you can come out ahead. That’s because you’ll be able to move more tasks through the system at the same time, and earn the benefits of increased processing, and higher completion rates.
- Profiling code to remove/hide communication and I/O bottlenecks. This is a well-known principle of performance optimization in general, and one that’s especially relevant to HPC. Given that there is finite communication and I/O bandwidth, isolating and minimizing I/O or parallel communications – both profound sources of potential delay in HPC applications – can boost overall throughput and productivity. Try to orchestrate your code so that when calculations naturally queue up for processor access, I/O and/or communications requests are also simultaneously issued. Using asynchronous or non-blocking communications or I/O, you can “hide” the time required to pass data from sender to receiver (transmission latency) or to flush a data write to disk (I/O latency). In the same vein, designating a single processor or a small group of processors to handle all I/O requests using asynchronous I/O will free up other computational nodes to continuously perform calculations, rather than blocking and waiting for I/O requests to be serviced.
- Managing memory to maximize performance. Careful examination of the memory usage profile in code can provide insights into performance. If certain code is memory bound, or limited in scalability due to memory constraints, perhaps the situation can be improved by dividing or partitioning that code so it can run in parallel on multiple processors or threads. Perhaps some large data arrays can be distributed among multiple (or many) processors and collectively updated or exchanged using MPI communications. Though such problems can only be subdivided into so many components, perhaps spreading those components across nodes with multiple cores and shared memory will help. This can be particularly beneficial if the code can be optimized using OpenMP threading. You may have to experiment with various approaches to get your “divide and conquer” designs working at their best, but these kinds of optimizations can produce incredible results.
- Overcoming communications limitations. All-to-all communications are expensive yet sometimes unavoidable, such as when setting up shared parameters and data sets at the beginning of a job run. Examining the underlying algorithms and looking for ways to reduce all-to-all communications is a tried-and-true way to improve performance. MPI performance is particularly susceptible to tuning using environment variables and system calls to guide the size and number of buffers per processor.
Overall, optimization is always a trade-off. The more time and effort you expend in adjusting and tuning, the better you’d expect the processing outcomes to be. If you provide clear guidance to developers on the best options for overcoming various types of constraints – such as memory bound, communications bound, or I/O bound – you can help them boost performance by quite a bit. But if you spend too much time preparing, or too little time processing, you’ll start affecting productivity in other ways. Strike the right balance and you’ll see productivity improve; keep up your optimizations, and those improvements should continue.
We’d be happy to talk about how code optimization could improve the productivity of your own HPC system. Just reach out.