Production Supercomputing
A Government organization responsible for the generation of weather forecast guidance products needed to ensure its high-performance computing (HPC) systems were available 24/7/365. RedLine introduced timely major architectural and systems design changes to ensure critical systems reliability.
Challenge
Analysts at a Government organization with the mission of delivering timely weather-related forecasting guidance needed their main computing system available 24 hours a day, seven days a week, 365 days a year (24/7/365). Their backup system had reduced capacity and did not provide the same level of accuracy with respect to forecast guidance. Decreased forecasting ability could have a major impact on public safety and personal property.
Solution
To build the required level of reliability into the system, RedLine worked with a primary contractor to design and implement an architectural change from a mainframe to a massively parallel distributed memory system with multiple components of availability built into two supercomputers. The primary and backup systems were essentially identical and multiple levels of redundancy were included.
These critical changes increased capacity, introduced high availability, and eliminated single points of failure within each system. Initial RedLine services included:
- Assisting with code porting and tuning of numerical models
- Developing system monitoring capabilities
- Establishing best practices and procedures for system administration, production, and development daily use
- Training users
Building on this success, the prime contractor engaged RedLine to lead the 24/7 operational support team, assist the customer’s development community with code optimization and debugging, and introduce new system capabilities while maintaining the same high level of reliability.
Results
The operational uptime for the customer’s main computing systems increased from under 90 percent to 99 percent. Increased timeliness and accuracy of forecasting guidance improved public safety by providing much needed data to analysts in a consistent and reliable manner.
The prime contractor’s success on the initial three-year contract resulted in a follow-on 10-year contract and, most recently, another 10-year contract award. RedLine maintains an outstanding relationship with both customer and prime, and continues to lead the supercomputing operational support team into the future. Lessons learned and processes developed from this complex environment form a solid basis for the services RedLine delivers to all customers.