For high-performance computing professionals, a deep understanding of how programs execute is vital. Often, that knowledge can spell the difference between making proper use of computing resources and meeting deadlines – or not. Keys to optimizing HPC resources include profiling program elements such as functions or subroutines and identifying and alleviating potential bottlenecks.
Profiling refers to the process of observing a program while it’s executing, and collecting the time required (and overall time spent) for each function or subroutine in a program. This data may then be sorted or ranked to determine where the most computing cycles are focused. Such frequently used program elements make the best candidates for optimization, because improving their performance will provide the greatest return on the time and effort invested.
When it comes to profiling, this can occur in either serial or parallel fashion. Serial profiling produces a set of timings for each function or subroutine in a program. Amdahl’s Law limits the degree of acceleration that can be achieved by optimizing any single serial function or subroutine. But Amdahl’s Law also extends to parallel programs (or functions and subroutines) as well.
For example, code that is 90 percent parallel and 10 percent serial can only be accelerated 10x through parallelization. But when parallelization is put to work, parallel profiling also results in significantly more data to be analyzed. That’s because timing must be measured for each subroutine or function as it runs on each processor used (for MPI-based parallelization) or execution thread employed (for openMP-based parallelization).
Furthermore, when a program is parallelized, it also becomes necessary to measure and rank communication latency as part of the profiling process. This also introduces additional factors to consider when looking for bottlenecks, and will also require time and effort to analyze and address.
The Tuning and Analysis (TAU) Utilities Can Help
TAU is an open source set of tools freely available to assist with profiling parallel programs written in languages including FORTRAN, C, C++, UPC, Java, and Python. These utilities can gather information through instrumentation of functions, subroutines, statements, methods, and other basic building blocks of code, and can provide event-based sampling as well. It works by automatically inserting timing calls in each function or subroutine in a program during code compilation. TAU creates a set of wrappers to the base compilers that include gnu, intel, pgi, and others. It is designed to automate the work involved in adding instrumentation to code, and to simplify collection and processing of the data that this effort produces.
TAU includes many options and capabilities. TAU supports wide flexibility in its configuration (and may be customized if and when that proves helpful), and in its application when profiling code. For example, many of its options may be dynamically specified at runtime using environment variables.
Building an executable TAU environment depends on the kind of parallel code your program, or programs, include. These may involve serial elements, along with parallel functions or subroutines that invoke either or both of the MPI and OpenMP parallel programming interfaces. Making use of TAU means specifying a wrapper for one or more of the supported compilers, and then managing use of either MPI or OpenMP. In turn, this means specifying use of MPI and the location of its libraries and include files; the same holds for OpenMP regarding the type of thread package (Papi, opari, and so forth) to use in constructing a TAU-instrumented executable program.
After that, you can compile your programs using the TAU wrapped compiler to insert its instrumentation. This requires setting various environment variables to guide compilation (CC=tau_cc.sh, CXX=tau_cxx.sh, FC=tau_f90.sh); providing a path to a Makefile version to use from its install directory (there may be several); and providing a list of TAU options to be used during compilation. Next, you’ll execute your program using the TAU shell (tau_exec.sh), where special instructions apply to serial programs (tau_exec.she myprog) or parallel ones (for example, mpirun -np X -npernode 7 tau_exec.sh myprog for an MPI-based parallel program).
Making Use of Program Data from TAU
The TAU environment also features a profile visualization tool named paraprof. It provides graphical displays for the performance analysis results that TAU obtains via code instrumentation in a variety of different forms and displays. With this information, users can quickly identify performance bottlenecks in their functions and subroutines. Furthermore, because TAU can generate event traces that work with other open source trace visualization tools, it can also help address latency issues related to communications and input/output activities.
As certain key functions or subroutines are altered to improve performance and reduce or eliminate bottlenecks, TAU can provide metrics to show how things are improving. It can also help guide ongoing optimization efforts because no sooner is one bottleneck addressed or mitigated than another one manifests to take its place. This makes TAU and paraprof invaluable tools in the HPC developer’s arsenal.
For a deeper conversation about best practices for deploying TAU and paraprof in your environment, just reach out.