HPC Systems are frustratingly complex beasts to tame. Competing interests from hardware and software support teams, coupled with demands from IT security compliance, can make having “consistency” in a cluster tough to achieve.
Baselining is the natural answer. With baselining, we set a series of “custom” benchmarks. By custom, I mean they are a given set of benchmarks that come together to provide a quantitative analysis of the performance of an HPC system. Are the numbers used for the individual benchmarks industry leading? Perhaps not. But they quantify how your system needs to operate in order to run its applications properly.
Consider incorporating baselines into your normal preventative maintenance schedule. Work with your change management coordinator to ensure application benchmarks are performed prior to releasing the system to systems administrators for maintenance. Allow time for system hardware baselines to be executed during maintenance and make sure that application baselines are completed and the results are analyzed prior to exiting maintenance.
The importance of pre-maintenance baselines can’t be underestimated: they represent to everyone that the system is running as expected prior to engaging in system maintenance. With a pre-maintenance baseline, system operations teams and application development teams have assurance that the system operates in a consistent manner prior to engaging in maintenance. Executing the same baselines upon completing maintenance provides a similar assurance that un-intended/unknown changes were not introduced through the maintenance event.
Should you encounter a failure of the pre-maintenance baseline, it may be unwise to turn over the system to the systems administration team for them to introduce more change to a system that already has an unknown error. Rescheduling the maintenance, if possible, to permit the troubleshooting of the existing problem without introducing new change, may be a better course of action.
While change in an HPC system is inevitable, change in the core benchmarks and baselines is not. Finding the right mix of benchmarks that represent your system performance adequately takes time and effort but once established they should be maintained. They can live for years and even across systems.
These core baselines create a basis for comparing change in your system across updates, and even permit performance comparisons between dissimilar systems. For the life of the system, consider the core baseline benchmarks to be sacrosanct and immutable. There’s immense value in being able to quantify the performance of your system across its lifetime. Considering a change to the baseline benchmarks invalidates the results of future runs against the old baseline results, becoming an apples and oranges comparison.
For system administrators, having immutable core baselines gives them a target to keep the system running optimally as they apply system updates and changes. For application management teams, the immutability of a local application benchmark provides some interesting insights:
- Certainty that the HPC system is executing as expected.
- A point-in-time lookback of how their code executed in the past using a code-base they couldn’t have changed.
- The quantitative expression, through an application benchmark result, of any changes in underlying system infrastructure. For example, perhaps a change in an OFA or OFED driver results in a change in RDMA communications resulting in a performance increase (or worse, decrease.)
Consider carefully the core benchmarks you want to include in your system’s baseline profile. Once set, they should be maintained. This does not mean additional benchmarks cannot be added. They do need to change to incorporate the development of new hardware or coding methods, but you need to keep a core set that doesn’t change such that you can measure/compare clusters from a historical or maintenance upgrade perspective.
As an example, MPI-IO is an I/O API developed upon MPI and you may want to include that type of benchmark if your applications start using it. You also may want to include different GPU benchmarks or updated applications. However, it doesn’t mean you should get rid of the old benchmarks or older applications if you want to maintain the historical perspective on performance.
In my next post, I’ll get into some specific advice about what sort of baselining you need to be conducting to optimize the performance of your HPC system, including guidance on memory bandwidth tests, high-intensity CPU checks, interconnect networks, and more. In the meantime, reach out with any questions or for a deeper conversation about your own situation.