Scaling HPC applications on your hardware is a vital task in supercomputing. If an application can’t use all of the hardware devoted to it, then you have expensive equipment sitting idle and not doing anyone any good.
When looking at scalability, it is important to first ensure that the problem can scale. In other words, is the problem big enough to be broken into many component parts that can be solved in parallel? Assuming the problem will scale, the application itself will have a scalability bottleneck of one sort or another, and most likely, a series of bottlenecks. If there is a preponderance of serial code in your application, for example, or other dependencies that require serial execution, the code may need to be refactored to either eliminate the serial portions, or hide them by performing other work at the same time.
The classic approach to application scalability involves dividing the problem up so that it can be solved in parallel. What you’re doing here is dividing up your problem into many sub-domains for MPI-based parallelization. This will give each of your MPI processes a chunk of the problem to work on, distributing work across the MPI domain.
It is also possible to divide up the problem into compact, generally non-interdependent kernels that can be off-loaded to an accelerator, like a GPU or FPGA, to crunch. Accelerators have many more compute cores than CPUs and can run large numbers of calculations much faster than a traditional CPU, but to achieve best results, they require single instruction, multiple data (SIMD) types of problems, which is not always the type of problem being solved. SIMD operations are also often referred to as vectorization. Most processors today have at least some vector calculation ability, but accelerators will provide true vectorization.
The bottleneck(s) in a given application may already be known, but if it’s not obvious, you can use a profiling tool such as TAU to identify the bottleneck or bottlenecks in the code.
Let’s look at each common scaling barrier in turn and discuss how to conquer (or at least work around) them:
Memory-bound
- Symptoms: The application requires large global arrays that scale in size with the problem, but require that each of these arrays be kept on the initial processor.
- Possible solution: Redesign your algorithm to break up the global arrays and use MPI-based processes to retrieve information that might be stored on other processors.
Communications-bound
- Symptoms: Interprocess communications wind up taking more and more wall time as the problem gets larger or the application scales to more processes.
- Possible solutions:
- Reduce all-to-all and synchronous collective communications.
- Rework the algorithm to remove any MPI_Barrier calls.
- Replace synchronous communications with asynchronous communications to hide communication latency.
Compute-bound
- Symptom: Solution simply can’t be computed quickly enough, like a weather forecast that is taking longer than real-time to compute.
- Possible solutions:
- Port sections of the code that are used most to an accelerator such as a GPU or FPGA.
- Investigate using a different method to solve key parts of your algorithm, like replacing an explicit method with an implicit method, or using a fast multi-pole method.
I/O-Bound
- Symptom: The code is spending increasing amounts of wall time either reading or writing files.
- Possible solutions:
- Upgrade to a parallel filesystem such as Spectrum Scale (GPFS) or Lustre.
- Parallelize the reads and writes so that I/O doesn’t all have to funnel through a single processor (this can be done using MPI-I/O or by writing many individual files to be combined in post-processing).
- Use burst buffers to stage data to and from the filesystem.
- Implement asynchronous writes using either MPI-I/O or setting up a specific MPI communicator to handle I/O.
Making an application truly scalable will require multiple iterations through this process because as soon as one barrier is overcome, another will appear. However, each iteration will result in faster, more efficient code and will boost performance, throughput, and hardware utilization. You will be pushing your hardware closer and closer to its limits and thus getting your money’s worth out of it.
For a deeper discussion on scaling your own HPC system, reach out.