Job management
Jobs in Vision are managed by the Slurm Workload Manager, a job scheduler which schedules jobs according to the available computational resources and the requirements of each job.
How to submit a jobs
To submit a job in Slurm you have to first create a job script, which defines the job, the required resources by your job and the tasks that will run in the nodes. In particular, you should:
Access Vision: check How to access Vision for details
Copy your code to Vision: use
scpor another tool that transfer files over SSH.Create a Slurm job script: check Slurm job scripts. and Job examples for details.
Submit the job: see bellow.
Slurm job scripts
Slurm job scripts allow users to define the jobs that will run in the cluster. These are bash scripts which allows specify: 1) general properties of job, such as name an output files; 2) the resources that will be allocated to job; and 3) the code that will be run.
The following example represents a simple Slurm job script:
1#!/bin/bash
2#SBATCH --job-name=cnn # create a short name for your job
3#SBATCH --output="slurm-cnn-venv-%j.out" # %j will be replaced by the slurm jobID
4#SBATCH --nodes=1 # node count
5#SBATCH --ntasks=1 # total number of tasks across all nodes
6#SBATCH --cpus-per-task=4 # cpu-cores per task (>1 if multi-threaded tasks)
7#SBATCH --gres=gpu:2 # number of gpus per node
8
9source venv/bin/activate
10
11python3 cnn.py
12
13deactivate
In this script, we start by 1) defining the job name and the output file (lines 3-3); 2) specifying the required resources (4 CPUs and 2 GPUs) (lines 4-7); and 3) defining the code that will run (lines 9-13). For more information about this and other examples, please check Job examples.
Accounting
The resources used by a user are always associated with project and are logged in order to control the cluster usage. To check the resources used in a time interval, the users should use the sreport command:
$ sreport cluster AccountUtilizationByUser account=prject start=0101 -T cpu,gres/gpu
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2022-01-01T00:00:00 - 2022-05-10T23:59:59 (11232000 secs)
Usage reported in TRES Minutes
--------------------------------------------------------------------------------
Cluster Account Login Proper Name TRES Name Used
--------- --------------- --------- --------------- -------------- ---------
hpc project-x cpu 2863935
hpc project-x gres/gpu 88738
hpc project-x user1 user1 cpu 17
hpc project-x user1 user1 gres/gpu 14
hpc project-x user2 user2 cpu 2863917
hpc project-x user2 user2 gres/gpu 88724
In this example, the user is checking the resources consumed in the context of project project-x since January 1. Usage values are presented in minutes.
Common Slurm commands
In this section you can find some common useful commands to submit, manage and monitor jobs. You can find more detailed information on https://slurm.schedmd.com/quickstart.html.
Submit a job
To submit a job, you should use the sbatch command:
$ sbatch my-job-script.sh
Submitted batch job 439
In this example, the job was submitted with the id 439.
Cancel a job
To cancel a job, the user should use the scancel command:
$ scancel 439
In this example, the job was 439 was cancelled.
List job queue
To list the job queue, the user use should use the command squeue. This command lists all submitted jobs to the cluster, including the job status and the node(s) where they are running:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
444 compute theJobNa user R 11:01 1 vision2
List job information
To list detailed information about a job, the user should use the scontrol command. This list the relevant information about the job, including the requested resources, the job script and the output files. The user needs to know the job id:
$ scontrol show jobid <jobId>
JobId=444 JobName=The_Job_Name
UserId=user(1000) GroupId=emedeiros(1000) MCS_label=N/A
Priority=4294901696 Nice=0 Account=asr-pt QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:18:08 TimeLimit=5-00:00:00 TimeMin=N/A
SubmitTime=2022-05-11T10:32:29 EligibleTime=2022-05-11T10:32:29
AccrueTime=2022-05-11T10:32:29
StartTime=2022-05-11T10:32:30 EndTime=2022-05-16T10:32:30 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-11T10:32:30
Partition=compute AllocNode:Sid=vision1:3110592
ReqNodeList=vision2 ExcNodeList=(null)
NodeList=vision2
BatchHost=vision2
NumNodes=1 NumCPUs=255 NumTasks=1 CPUs/Task=255 ReqB:S:C:T=0:0:*:*
TRES=cpu=255,mem=980288M,node=1,billing=255,gres/gpu=8
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=255 MinMemoryNode=980288M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/path/to/my-job-script.sh
WorkDir=/path/of/my-job
StdErr=/path/to/my-job-script.out
StdIn=/dev/null
StdOut=/path/to/my-job-script.out
Power=
TresPerNode=gpu:8
NtasksPerTRES:0
Check node status
To check the status of eacch node of the cluster, users should use the sinfo command:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 1 mix vision1 compute* up infinite 1 alloc vision2 debug up 15:00 1 mix vision1 debug up 15:00 1 alloc vision2