Convolutional Neural Network (CNN)
This is an example on how to submit a Slurm job which uses Conda for dependency management. This example uses TensorFlow and is based on the official examples from TensorFlow: https://www.tensorflow.org/tutorials/images/cnn.
Conda is available in all nodes of Vision (head and compute nodes) in /opt/conda/. You can use this Conda version, or if you prefer, can install another version in your home folder: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html.
In this example we use Conda installed in /opt/conda/.
To submit a Python application that uses Conda for dependency management and project isolation as a Slurm job, you need to perform the following tasks:
Create the Conda environment with the project dependencies
Define the Slurm job script
Submit the Slurm job
1. Creating the Conda environment
To submit this script in Slurm using Conda for dependency management you should start by activating Conda:
$ source /opt/conda/etc/profile.d/conda.sh
and then create the Conda environment. To create the Conda environment, you can use:
create it manually
use an envirnment file
Creating the Conda environment manually
To create the Conda environment manually, you should start by creating the Conda environment:
(base) $ conda create -n tf-gpu tensorflow-gpu
activate the Conda environment:
(base) $ conda activate tf-gpu
and install the project dependencies:
(tf-gpu) $ pip install tensorflow==2.7.0
(tf-gpu) $ pip install matplotlib
After installing all dependencies, you should deactivate the virtual environment:
(tf-gpu) $ conda deactivate
Creating the Conda environment using an environment file:
To create the Conda environment from the environment file you should run the following command:
(base) $ conda env create -f environment.yml
This will create a Conda envirnment with the name and dependencies defined in the file environment.yml
2. Configure the Slurm job script
To submit the job, you first have to create a slurm batch script where you have to specify the required resources that will be allocated to your job and specify the tasks that will be run.
The following is a Slurm job script to run this project:
#!/bin/bash
#SBATCH --job-name=cnn # create a short name for your job
#SBATCH --output="slurm-cnn-conda-%j.out" # %j will be replaced by the slurm jobID
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=4 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2 # number of gpus per node
source /opt/conda/bin/activate
conda activate tf-gpu
python3 cnn.py
conda deactivate
The script is made of two parts: 1) specification of the resources needed as well to run the job as some general job information; and 2) specification of the taks that will be run.
In the first part of the script, we define the job name, the output file and the requested resources (4 CPUs and 2 GPUs). Then, in the second part, we define the tasks of the job. When using Conda, we should run the following:
Activate the Conda environment;
Excecute the code;
Deactivate Conda environment;
3. Submit the job
To submit the job, you should run the following command:
$ sbatch script_conda.sh
Submitted batch job 144
You can check the job status using the following command:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
143 batch cnn user R 0:33 1 vision2