Job management

Jobs in Vision are managed by the Slurm Workload Manager, a job scheduler which schedules jobs according to the available computational resources and the requirements of each job.

How to submit a jobs

To submit a job in Slurm you have to first create a job script, which defines the job, the required resources by your job and the tasks that will run in the nodes. In particular, you should:

Access Vision: check How to access Vision for details

Copy your code to Vision: use scp or another tool that transfer files over SSH.

Create a Slurm job script: check Slurm job scripts. and Job examples for details.

Submit the job: see bellow.

Slurm job scripts

Slurm job scripts allow users to define the jobs that will run in the cluster. These are bash scripts which allows specify: 1) general properties of job, such as name an output files; 2) the resources that will be allocated to job; and 3) the code that will be run.

The following example represents a simple Slurm job script:

#!/bin/bash
#SBATCH --job-name=cnn                    # create a short name for your job
#SBATCH --output="slurm-cnn-venv-%j.out"  # %j will be replaced by the slurm jobID
#SBATCH --nodes=1                         # node count
#SBATCH --ntasks=1                        # total number of tasks across all nodes
#SBATCH --cpus-per-task=4                 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2                      # number of gpus per node

source venv/bin/activate

python3 cnn.py

deactivate

In this script, we start by 1) defining the job name and the output file (lines 3-3); 2) specifying the required resources (4 CPUs and 2 GPUs) (lines 4-7); and 3) defining the code that will run (lines 9-13). For more information about this and other examples, please check Job examples.

Accounting

The resources used by a user are always associated with project and are logged in order to control the cluster usage. To check the resources used in a time interval, the users should use the sreport command:

$ sreport cluster AccountUtilizationByUser account=prject start=0101 -T cpu,gres/gpu
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2022-01-01T00:00:00 - 2022-05-10T23:59:59 (11232000 secs)
Usage reported in TRES Minutes
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name      TRES Name      Used
--------- --------------- --------- --------------- -------------- ---------
      hpc       project-x                                      cpu   2863935
      hpc       project-x                                 gres/gpu     88738
      hpc       project-x     user1           user1            cpu        17
      hpc       project-x     user1           user1       gres/gpu        14
      hpc       project-x     user2           user2            cpu   2863917
      hpc       project-x     user2           user2       gres/gpu     88724

In this example, the user is checking the resources consumed in the context of project project-x since January 1. Usage values are presented in minutes.

Common Slurm commands

In this section you can find some common useful commands to submit, manage and monitor jobs. You can find more detailed information on https://slurm.schedmd.com/quickstart.html.

Submit a job

To submit a job, you should use the sbatch command:

$ sbatch my-job-script.sh
Submitted batch job 439

In this example, the job was submitted with the id 439.

Cancel a job

To cancel a job, the user should use the scancel command:

$ scancel 439

In this example, the job was 439 was cancelled.

List job queue

To list the job queue, the user use should use the command squeue. This command lists all submitted jobs to the cluster, including the job status and the node(s) where they are running:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             444     compute theJobNa     user  R      11:01      1 vision2

List job information

To list detailed information about a job, the user should use the scontrol command. This list the relevant information about the job, including the requested resources, the job script and the output files. The user needs to know the job id:

$ scontrol show jobid <jobId>
JobId=444 JobName=The_Job_Name
   UserId=user(1000) GroupId=emedeiros(1000) MCS_label=N/A
   Priority=4294901696 Nice=0 Account=asr-pt QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:18:08 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2022-05-11T10:32:29 EligibleTime=2022-05-11T10:32:29
   AccrueTime=2022-05-11T10:32:29
   StartTime=2022-05-11T10:32:30 EndTime=2022-05-16T10:32:30 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-11T10:32:30
   Partition=compute AllocNode:Sid=vision1:3110592
   ReqNodeList=vision2 ExcNodeList=(null)
   NodeList=vision2
   BatchHost=vision2
   NumNodes=1 NumCPUs=255 NumTasks=1 CPUs/Task=255 ReqB:S:C:T=0:0:*:*
   TRES=cpu=255,mem=980288M,node=1,billing=255,gres/gpu=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=255 MinMemoryNode=980288M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/path/to/my-job-script.sh
   WorkDir=/path/of/my-job
   StdErr=/path/to/my-job-script.out
   StdIn=/dev/null
   StdOut=/path/to/my-job-script.out
   Power=
   TresPerNode=gpu:8
   NtasksPerTRES:0

Check node status

To check the status of eacch node of the cluster, users should use the sinfo command:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      1    mix vision1
compute*     up   infinite      1  alloc vision2
debug        up      15:00      1    mix vision1
debug        up      15:00      1  alloc vision2