As HPC architectures continue to evolve and offer ever-increasing performance, it has become imperative to adapt existing software in order to fully harness that power. As discussed in an earlier post, the architectural approach to parallelism has come full circle in many ways. Nonetheless, MPI remains a fixture in HPC software design, and finding ways to combine it with OpenMP and vectorization paradigms such as CUDA, OpenCL, and the emerging Intel MIC coprocessors is vital.
One of the key issues in combining MPI with OpenMP is the granularity of the data and the algorithms being implemented. Generally, codes that utilize MPI divide up a problem into separate chunks of memory and distribute that work out to individual nodes to do the work. In most cases, there will also be a requirement to pass information back and forth between the nodes to keep the full solution in synch. MPI excels at these sorts of tasks, particularly if the communications can be achieved in an asynchronous manner which hides the inherent latency it requires.
However, many algorithms are either memory-bound or can be subdivided only into a relatively small number of partitions. Global atmospheric models are a good example of the latter case, since they are generally decomposed in two ways – horizontally into a lat/long grid covering the globe, and vertically into (typically fewer than 256) layers stretching from the surface to the top of the atmosphere. The horizontal grid will have many, many grid cells, so it can be divided between thousands of processors using MPI.
Some operations require work to be done on an entire level of the vertical column, however. Since there are usually 256 layers or less, this work can only be spread to 256 (or fewer) nodes without major refactoring of the code. Utilizing vectorization constructs employed by OpenMP in these sorts of situations can significantly improve performance. This is because each node will have anywhere from 16 to 44 CPU cores. Each of those cores could be addressed or utilized via MPI, but the more efficient solution is to create an MPI communicator that encompasses only one processor on each node which will then control all the other CPU cores on the node via OpenMP threading.
In way of background, OpenMP programming involves the use of compiler directives which are placed into existing code. These directives generally tell the compiler to divide sections of the code into individual threads which are farmed out by the main CPU to other CPUs/cores it controls. Each thread has access to the same shared memory on the node, but can also specify certain variables and memory space to be private.
Not every operation is well suited to OpenMP, but it generally adapts very well to loop-based sections of code which are often the meat of any HPC application. The key to ensuring proper vectorization using OpenMP is identifying and separating any dependence on elements in the “vector” being operated upon. This can usually (though not always) be done by tweaking the order of some of the operations and identifying appropriate private variables within the vectorized loops. Fortunately, since OpenMP works best at the loop level, the sections of code being vectorized are relatively self-contained and much easier for a programmer to understand than the full parallelization paradigm of the overall code.
Perhaps the most exciting aspect of utilizing OpenMP within existing MPI codes is the emerging convergence of off-loading with OpenMP and the very similar OpenACC specification. Off-loading is the term used to indicate the process of employing an accelerator such as a GPU (graphics processing unit) or Intel’s MIC (many integrated cores) chip to really crunch the data. Both GPUs and MIC chips contain large numbers of computational cores which are ideally suited to vector operations. In the case of GPUs, the compute cores can be in the thousands (there are 3,072 cores on the Tesla M40, for example) while the Intel MIC chips generally have hundreds of threads and support OpenMP inherently.
In order to use either type of accelerator, however, the data must be off-loaded from the CPU to the device itself and then results reimported after they are computed. Explicitly adapting code to manage the data movement between CPU and accelerator can be a daunting task, but newer versions of OpenMP now support this offloading process in a much simpler manner. Similarly, the OpenACC standard, which closely parallels the OpenMP standard, offers an excellent way to quickly adapt code to utilize GPUs.
Since both GPUs and MIC coprocessors only seem to be gaining steam in the HPC arena, this convergence of hardware and emerging software standards offers a tremendous opportunity to not only increase performance on existing platforms, but also easily harness the power of newer architectures that will begin to dominate the marketplace. For a deeper conversation on how the trend will likely play out, get in touch.