Skip to content
RedLiner Portal
  • About
    • Leadership
    • Team
    • RedLine Performance Methodology
  • Expertise
    • Scientific Programming & Analysis
    • Enterprise IT Technical Infrastructure
    • HPC Systems Deployment / Management
    • HPC Storage & Networking
    • Cloud Computing
    • Mission Support
  • Contracts
  • Case Studies
  • News
  • Careers
  • Contact
  • About
    • Leadership
    • Team
    • RedLine Performance Methodology
  • Expertise
    • Scientific Programming & Analysis
    • Enterprise IT Technical Infrastructure
    • HPC Systems Deployment / Management
    • HPC Storage & Networking
    • Cloud Computing
    • Mission Support
  • Contracts
  • Case Studies
  • News
  • Careers
  • Contact

News & Blogs

Get Started
red arc

Porting the Global Workflow to Google Cloud Platform: Challenges and Lessons Learned

  • Application Engineering, Industry Trends
  • January 17, 2025
  • Dave Huber

In this post, we’ll dive into porting the Global Workflow, a critical weather forecasting system, to the Google Cloud Platform (GCP). The Global Workflow consists of two major components – the Global Data Assimilation System (GDAS) powered by the Gridpoint Statistical Interpolation (GSI) package and the Global Forecast System (GFS) – and is responsible for generating high-quality, global weather forecasts. The motivation behind this effort was to take advantage of the GPU capabilities on GCP for AI-based work, as well as to expand the computing resource portfolio for NOAA and the broader research community. By leveraging the cloud, the goal was to provide low-cost access to powerful computing resources that may not be readily available via on-premises HPCs. As a member of the team that undertook this project, I am sharing the key steps and challenges we faced during the porting process:

  1. Replicating a GFS Forecast: The first step was to replicate a GFS forecast that was initialized on an on-premises system (Hera) and run it on the GCP cluster within a 4-hour timeframe for a 384-hour forecast.
  2. Running a Full Global Workflow Cycle: The team aimed to run a C384 (~25 km mean resolution) 80-member global workflow cycle test with 4 GDAS cycles and 1 GFS cycle per day, completing the full cycle in less than 6 hours.
  3. Software and Cluster Configuration: At the time, the GSI had not been upgraded to a compiler newer than Intel version 18. However, the GCP only had Intel 2021.3 available, which posed significant challenges for this critical but aging system.  Optimization and debugging efforts were undertaken to refresh the GSI and its upstream dependencies. Despite these efforts, the EnKF (Ensemble Kalman Filter) re-centering utility (part of the GSI) would sometimes encounter non-reproducible erroneous values. Manual intervention to rerun the utility was required to obtain sensible solutions.  These issues were later resolved by myself with another team at NOAA (after this GCP experiment) following a Herculean effort to upgrade the GSI to Intel 2021 and a standard library suite known as Spack-Stack.
  4. Cluster Configuration: Determining the optimal MPI settings for the GCP cluster posed a significant challenge as both the GSI and the GFS make extensive use of varied communication schemes. In the end, TCP was chosen as the Open-Fabrics Interface as it provided the most stable communication, though other protocols (such as MLX to make use of Infiniband interconnects) would certainly save on cost.  Even so, stale file handles were periodically encountered during the ensemble forecasts, which the team suspected was due to the large number of jobs running concurrently causing significant strain on the network when propagating data.

 

Despite the challenges, the team was able to compare the results of the GFS forecasts and the global workflow cycles between the GCP and the Hera system. The analysis showed good agreement, with some minor biases in data-sparse regions, likely due to the differences in data sources between the two systems.

This exercise showed that the GCP could be a viable option for operational weather forecasting.  Now that the issues with the GSI and EnKF re-centering utility have been addressed, the team believes that the GCP can be a viable option for the Global Workflow, with the only challenge remaining being a more optimal Open-Fabrics Interface protocol such as MLX. The ability to leverage GPU resources for AI-based improvements to the analysis is also an exciting prospect for the future.

More Posts

Warewulf: Supercharging High-Performance Computing Clusters

September 23, 2025

Enhancing Continuous Integration Practices at NOAA EMC

July 24, 2025

Using Spack to Streamline Software Development, Testing, and Deployment

May 5, 2025

Streamlining HPC Workflows with ecFlow: A Game-Changer for Operational Efficiency

March 10, 2025

Advancing Atmospheric River Predictions Through Collaborative Innovation

June 7, 2024
Categories
Archives
Author
Picture of Dave Huber
Dave Huber
All Posts
PrevPreviousAdvancing Atmospheric River Predictions Through Collaborative Innovation
NextStreamlining HPC Workflows with ecFlow: A Game-Changer for Operational EfficiencyNext
red arc
RedLine Performance Solutions logo

Stay Connected

301-685-5949
webinfo@redlineperf.com
Connect on LinkedIn
RedLiner Portal

Services

  • Scientific Programming & Analysis
  • Enterprise IT Technical Infrastructure
  • HPC Systems Deployment / Management
  • HPC Storage & Networking
  • Cloud Computing
  • Mission Support
  • Scientific Programming & Analysis
  • Enterprise IT Technical Infrastructure
  • HPC Systems Deployment / Management
  • HPC Storage & Networking
  • Cloud Computing
  • Mission Support

© 2025 REDLINE | PRIVACY POLICY | WEBSITE BY: SASSE AGENCY