srun is a command-line tool used to create an interactive job or connect to an existing job on a SLURM-managed HPC cluster. An interactive job is a job that allows the user to directly access the compute nodes, either for debugging or for running interactive programs that require user input.
When used to create an interactive job, srun launches a shell on a compute node, allowing the user to execute commands interactively. This is useful for tasks such as debugging, testing, or running interactive programs that require user input. srun automatically allocates the necessary resources (such as CPU cores, memory, and GPU resources) required for the job and launches the job on the specified partition.
When used to connect to an existing job, srun allows the user to connect to a running job and run commands on the allocated resources. This can be useful for debugging, monitoring job progress, or making modifications to a running job.
srun provides a number of options for customizing the resources allocated for the job, including the number of CPU cores, memory, and GPU resources required. Additionally, srun supports parallel job execution through MPI and OpenMP, allowing users to run parallel applications on the cluster.
If you encounter the below error while running the command srun:
srun: command not found
you may try installing the below package as per your choice of distribution:
Distribution | Command |
---|---|
Debian | apt-get install slurm-client |
Ubuntu | apt-get install slurm-client |
Kali Linux | apt-get install slurm-client |
Fedora | dnf install slurm |
OS X | brew install slurm |
Raspbian | apt-get install slurm-client |
srun Command Examples
1. Submit a basic interactive job:
# srun --pty /bin/bash
2. Submit an interactive job with different attributes:
# srun --ntasks-per-node=num_cores --mem-per-cpu=memory_MB --pty /bin/bash
3. Connect to a worker node with a job running:
# srun --jobid=job_id --pty /bin/bash
Summary
In summary, srun is a powerful tool for interacting with the SLURM scheduler on HPC clusters. It provides a way to create interactive jobs and connect to running jobs, allowing users to easily debug, monitor, and modify jobs running on the cluster.