In this post, we’ll dive into porting the Global Workflow, a critical weather forecasting system, to the Google Cloud Platform (GCP). The Global Workflow consists of two major components – the Global Data Assimilation System (GDAS) powered by the Gridpoint Statistical Interpolation (GSI) package and the Global Forecast System (GFS) – and is responsible for generating high-quality, global weather forecasts. The motivation behind this effort was to take advantage of the GPU capabilities on GCP for AI-based work, as well as to expand the computing resource portfolio for NOAA and the broader research community. By leveraging the cloud, the goal was to provide low-cost access to powerful computing resources that may not be readily available via on-premises HPCs. As a member of the team that undertook this project, I am sharing the key steps and challenges we faced during the porting process:
- Replicating a GFS Forecast: The first step was to replicate a GFS forecast that was initialized on an on-premises system (Hera) and run it on the GCP cluster within a 4-hour timeframe for a 384-hour forecast.
- Running a Full Global Workflow Cycle: The team aimed to run a C384 (~25 km mean resolution) 80-member global workflow cycle test with 4 GDAS cycles and 1 GFS cycle per day, completing the full cycle in less than 6 hours.
- Software and Cluster Configuration: At the time, the GSI had not been upgraded to a compiler newer than Intel version 18. However, the GCP only had Intel 2021.3 available, which posed significant challenges for this critical but aging system. Optimization and debugging efforts were undertaken to refresh the GSI and its upstream dependencies. Despite these efforts, the EnKF (Ensemble Kalman Filter) re-centering utility (part of the GSI) would sometimes encounter non-reproducible erroneous values. Manual intervention to rerun the utility was required to obtain sensible solutions. These issues were later resolved by myself with another team at NOAA (after this GCP experiment) following a Herculean effort to upgrade the GSI to Intel 2021 and a standard library suite known as Spack-Stack.
- Cluster Configuration: Determining the optimal MPI settings for the GCP cluster posed a significant challenge as both the GSI and the GFS make extensive use of varied communication schemes. In the end, TCP was chosen as the Open-Fabrics Interface as it provided the most stable communication, though other protocols (such as MLX to make use of Infiniband interconnects) would certainly save on cost. Even so, stale file handles were periodically encountered during the ensemble forecasts, which the team suspected was due to the large number of jobs running concurrently causing significant strain on the network when propagating data.
Despite the challenges, the team was able to compare the results of the GFS forecasts and the global workflow cycles between the GCP and the Hera system. The analysis showed good agreement, with some minor biases in data-sparse regions, likely due to the differences in data sources between the two systems.
This exercise showed that the GCP could be a viable option for operational weather forecasting. Now that the issues with the GSI and EnKF re-centering utility have been addressed, the team believes that the GCP can be a viable option for the Global Workflow, with the only challenge remaining being a more optimal Open-Fabrics Interface protocol such as MLX. The ability to leverage GPU resources for AI-based improvements to the analysis is also an exciting prospect for the future.