Software & Betrieb
The HPC.NRW project coordinates the software environments and computer operations of the participating institutions. This results in two main areas:
- the coordination of the HPC software ecosystem on the HPC systems of the participating facilities
- the establishment of a competence network for operational issues in the HPC environment in NRW
The coordinating facility for these tasks is the Paderborn Center for Parallel Computing (PC2) at the University of Paderborn. For further information, please contact Robert Schade, email@example.com.
Software provision is one of the main tasks of HPC.NRW. This includes not only the appropriate simulation programs, but also components such as optimized numerical libraries, software development tools like sequential and parallel debuggers, tools for performance analysis, and general tools for the efficient use of HPC resources.
To this end, the needs of the data centers, their users, and the HPC support are regularly assessed. This is followed by an evaluation of possible alternatives.
HPC.NRW then acquires the required software and/or its alternatives and makes it available to the users. Particular attention is paid to the use of freely-available packages and, in the case of commercial software, at least a dual-vendor strategy is pursued in order to avoid a vendor lock-in. In addition, HPC.NRW supports other initiatives for state-wide software acquisition such as the state license for NAG compilers and libraries.
Statewide licenses for the following products have already been successfully negotiated by HPC.NRW. They are available via HPC.NRW and include support for all locations:
- Intel Parallel Studio XE/Intel OneAPI with Premier Support
- Totalview HPC Debugger
- PGI Compiler/Nvidia HPC SDK
- ARM Software (DDT, MAP, Performance Reports)
- Professional support for SLURM
If a state license only allows a limited number of licenses they are managed by shared and redundantly operated license servers so that the maximum number of license tokens is available at each facility. This allows the largest possible HPC jobs to be run with the programs or tools.
In addition, individual sites or smaller groups thereof have already successfully carried out further acquisitions through HPC.NRW.
SPECIAL CHALLENGES: SIMULATION PROGRAMS
The acquisition of simulation programs poses an additional challenge. On the one hand, licensing models are often designed for workgroups rather than data centers, and the software vendor’s client model does not incorporate all data center users (both local and external). On the other hand, simulation programs are usually specifically tailored to individual research directions. HPC.NRW circumvents these problems in different ways:
1. Negotiations with software providers who offer a workgroup model with the aim of converting or extending it to a data center model, or the introduction of free support and maintenance licenses.
2. Clustering of specific needs at certain sites by improving user mobility and documentation.
3. Offering assistance with the migration from commercial programs to free alternatives in close cooperation with HPC support and specialist advice.
In addition to the hardware, the software environment of an HPC cluster is one of the most important aspects of an HPC system. In this respect, harmonization of the software environment is almost certainly an essential prerequisite. HPC.NRW has improved the mobility of users between systems and, in particular, between different levels of the HPC supply pyramid (from workgroup clusters, through tier-3 and tier-2 systems to tier-0/1 systems). Users are sufficiently qualified to efficiently utilize HPC systems of higher tiers.
The HPC.NRW competence network pursues an implicit path with regard to harmonization: Provision of best practice guides and practical assistance on HPC operation aim to achieve a gradual harmonization, which still offers sufficient flexibility to incorporate local specifications such as special hardware or the needs of individual user groups.
The HPC security incident in the spring of 2020, which affected systems worldwide, including some in NRW, has shown that best practice guides are essential, especially in the area of HPC system security, and that they are also gratefully received in institutions outside the state of NRW.
The best practice guides are first developed within HPC.NRW and, as soon as they have reached a certain level of maturity, they are published on the HPC Wiki, which is publicly available to a worldwide audience. This accessible approach allows a worldwide collaboration to support the ongoing development of the guides.
MAPPING THE HPC SOFTWARE ECOSYSTEM.
Joining forces: a prerequisite for the bundling of software requirements or harmonization is to evaluate the HPC software ecosystem on offer across all sites. Several aspects are relevant here:
- What software is available or installed?
- What consulting expertise is available for the software or the subject area at the site?
- Which software is used by the researchers at all?
The competence network HPC.NRW answers these questions by combining several methods:
1. Cluster-wide software installations: Automated logging of cluster-wide software installations and, especially, environment modules. The results of the evaluation with respect to the distribution of a software give important indications as to which program’s support should be extended. The evaluation of the modules made it possible to collect representative information about the use of naming conventions in the environment modules. These findings are incorporated into the best practice guides and facilitate harmonization.
2. User perspective: the software employed by the users is polled in a user survey.
3. Survey of software usage in computational jobs: For this purpose, a lightweight data acquisition system was implemented that detects the programs employed by users at runtime in a non-invasive (no wrappers in job scripts or similar required) and easy-to-use way (for example, no separate database servers or similar required).
A competence network to support the participating sites in operational issues was established through the following activities:
- An NRW-wide HPC help-desk was established as 3rd-level support to assist data center staff if problems arise that cannot be resolved locally. Further information
- Implementation of communication platforms such as the creation of a protected area in the HPC Wiki for joint discussions on HPC operational topics, e.g. data exchange between HPC centers on a national and international level: Further information
- Development of best practice guides addressing computer operation: HPC-Wiki
- Coordination of an intensive exchange between computing centers in NRW on many technical issues. Further information
- Presentations by members of HPC.NRW at HPC conferences such as the meetings of the ZKI supercomputing working group (ZKI AK SC).
Head of Software & Operation
RWTH Aachen University