Registration has ended
Info
Content
Batch Job Scheduling
Level: basic - advanced. Adapted exercises offer opportunities for knowledge enhancement at all levels. As job scheduling makes use of shell scripting, it is recommended to attend the course introduction into Linux and shell scripting in the morning as well.
HPC Skill Tree: K4.2
The resources of HPC systems are managed by a job scheduler, therefore knowing how to appropriately use the scheduling system is critical for working on HPC systems. This course gives an introduction to the concepts of batch job scheduling and teaches techniques that facilitate submitting and managing multiple and interdependent jobs by using advanced features of job schedulers. The concepts are illustrated using the SLURM scheduler, which is used at all university clusters in Hessen.
In this course, participants will learn about:
- Creating job scripts: Job scripts define a calculation's resource requirements, runtime environment, and what software to run. There is a plethora of parameters that can be set to exactly configure process distribution, memory and GPU allocation, output and user-side notifications. As HPC computations are almost exclusively started via job scripts, this knowledge is essential for working on a cluster.
- Controlling and monitoring cluster jobs: By using the tools provided by Slurm, users can submit, update, cancel and monitor their currently running calculations, as well as obtain performance metrics and other metadata from completed jobs. Especially when running large computation campaigns, using these tools is invaluable for keeping an overview over the project and to gather insights for future research.
- Modelling multi-step workflows with job arrays and dependencies: Oftentimes, a project will require running a large number of similar calculations, or setting up a string of calculations that might depend on each other's results, or both. We explore the possibilities provided by Slurm's job array and job dependency features, and how both of these may be used to enable automation of HPC tasks to avoid repetitive work.