Running Jobs¶
Slurm¶
FarmShare uses Slurm for job (resource) management. Jobs are scheduled according to a priority which depends on a number of factors, including how long a job has been waiting, its size, and a fair-share value that tracks recent per-user utilization of cluster resources. Lower-priority jobs, and jobs requiring access to resources not currently available, may wait some time before starting to run. The scheduler may reserve resources so that pending jobs can start; while it will try to backfill these resources with smaller, shorter jobs (even those at lower priorities), this behavior can sometimes cause nodes to appear to be idle even when there are jobs that are ready to run.
Slurm commands¶
Slurm Man Pages
Full documentation and detailed usage information is provided in the man pages.
Slurm allows requesting resources and submitting jobs in a variety of ways. The
main Slurm commands to submit jobs are listed in the table below:
Command | Description | Behavior |
---|---|---|
salloc |
Request resources and allocates them to a job | Starts a new shell, but does not execute anything |
srun |
Request resources and runs a command on the allocated compute node | Execute command on compute node |
sbatch |
Request resources and runs a script on the allocated compute node | Submit a batch script to Slurm |
squeue |
View job and job step information | Displays job information |
scancel |
Signal or cancel jobs, job arrays or job steps | Cancel running job |
sinfo |
View information about Slurm nodes and partitions | Displays partition information |
scontrol |
View detailed information on job, node, partition, reservation and configuration | Displays detailed Slurm information |
Interactive Jobs¶
Interactive sessions that require resources in excess of limits on the login nodes, exclusive access to resources, or access to a feature not available on the login nodes (e.g., a GPU), can be submitted to a compute node. Each user is allowed one interactive job, which may run for at most one day. You can use the srun
command to request one:
The example below shows how to request an interactive session using srun
:
ta5@rice-02:~$ srun --partition=interactive --qos=interactive --pty bash
ta5@iron-03:~$
Notice that the prompt changed from rice-02
(login node) to iron-03
(compute node). Check your job info with squeue
:
ta5@iron-03:~$ squeue -a -u ta5
JOBID PARTITION NAME USER ST TIME_LIMIT NODES CPUS MIN_MEMORY
309641 interactive bash ta5 R 1:00:00 1 1 4000M
Here we can see the default limits of one cpu, 4000MB memory and one hour walltime.
Batch Jobs¶
The sbatch
command is used to submit a batch job. A job is simply an instance of your application (for example your R or Python script) that is submitted to and executed by the job scheduler (Slurm). When you submit a job with the sbatch
command it’s called a batch job. Options are used to request specific resources (including runtime), and can be provided either on the command line or, using a special syntax, in the script file itself.
Common options are outlined below. Some options have a short and long version. Refer to the man page for all options.
CPUs: -c, --cpus-per-task
How many CPUs the application you are calling the in the sbatch script needs, unless it can utilize multiple CPUs at once you should request a single CPU. Confirm Check your code’s documentation or try running in an interactive session with htop
if you are unsure.
memory (RAM): --mem
How much memory your job will consume. Some things to consider, will it load a large file or matrix into memory? Does it consume a lot of memory on your laptop? Often the default memory is sufficient for many jobs.
time: -t, --time
How long will it take for your code to run to completion?
partition: -p, --partition
What set of compute nodes on FarmShare will you run on, normal, interactive, bigmem? The default partition on FarmShare is the normal partition.
Example¶
To submit batch jobs to the scheduler:
- Create an application script
- Create a Slurm job script that runs the application script
- Submit the job script to the job scheduler using
sbatch
Sample python script sum.py
to calculate the sum of one to five:
a = (1, 2, 3, 4, 5)
x = sum(a)
print(x)
Sample job submit script tutorial.sh
requesting one cpu on one node using the normal partition to run python sum.py
:
#!/bin/bash
#SBATCH --job-name=tutorial
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=normal
python sum.py
Submit the job with sbatch
:
$ sbatch tutorial.sh
Submitted batch job 300992
Partition/QoS Info¶
FarmShare provides the following partitions and QoS:
Partition | Max Memory | Max CPU |
---|---|---|
normal | 188GB | 256 |
bigmem | 768GB | 344 |
interactive | 188GB | 16 |
$ sacctmgr show qos format=name%11,maxsubmitjobspu,maxjobspu,mintres%10,maxtrespu%25,maxwall
Name MaxSubmitPU MaxJobsPU MinTRES MaxTRESPU MaxWall
----------- ----------- --------- ---------- ------------------------- -----------
normal 1024 128 cpu=256,gres/gpu=3
interactive 3 3 cpu=16,gres/gpu=1,mem=64G
dev 1 1 cpu=8,gres/gpu=1,mem=32G 08:00:00
long 32 4 cpu=32 7-00:00:00
caddyshack 1 1 cpu=8,mem=32G
bigmem 32 4 mem=192G mem=768G
gpu 32 4 gres/gpu=1 gres/gpu=6