Job Controling

As the cluster is very large and jobs will always get lost or killed, it is important to check carefully that all jobs have completed successfully. The slurm outputput file doesn't contain much information about the job.

Learning Objectives

◇ Knowing how code reproducible slurm scripts.
◇ Knowing how add checkpoints to scripts.

Counting the output files and sorting them by size is also useful, but even more important is looking at the standard output or log file (e.g. slurm*.out).

  1. Examine the log file carefully, as each tool may output errors differently (e.g.grep for error, kill, Error, Killed, exit, Exited, ..).
  2. Echo statements in the submit script can be used as a check for your job control and you can grep for the terms.
#!/bin/bash
#SBATCH --job-name=bcf        #Name of the job
#SBATCH --ntasks=1            #Requesting 1 node (is always 1)
#SBATCH --cpus-per-task=1     #Requesting 1 CPU
#SBATCH --mem-per-cpu=1G      #Requesting 4 Gb memory per core
#SBATCH --time=4:00:00        #Requesting 4 hours running time
#SBATCH --output bcf.log      #Log


##########################################################################################
echo "Nik Zemp, GDC, 02/01/24"
echo "$(date) start ${SLURM_JOB_ID}"
##########################################################################################

#Load the needed modules
module load bcftools/1.16

#define in and outputs
out=SNPs
if [ ! -e ${out} ]  ; then mkdir ${out} ; fi
Ref=Ref/Ref.fasta

echo "The bcftools command"
bcftools mpileup -f ${Ref} --skip-indels -b bam.lst -a 'FORMAT/AD,FORMAT/DP' \
    | bcftools call -mv -Ob -o ${out}/raw.bcf

##########################################################################################
##job controlling, grep "Job:" *.out###
echo "JobID: ${SLURM_JOB_ID} TaskID: ${SLURM_ARRAY_TASK_ID} successfully finished at $(date)"
myjobs -j ${SLURM_JOB_ID}
##########################################################################################

Now you can grep for "JobID:".

The slurm webGUI can also be used to search for failed jobs as well.

reportseff is a python tool that gives you an overview of whether the jobs have been completed and how efficient they were. If you are not part of the GDC share, you will need to install the tool first.

The following lines give you a summary of all jobs since 6 hours.

module load reportseff/2.7.6
reportseff --user $USER --since h=6

Use your mapping script and make it as reproducible as possible by adding comments and echo statments.

What strings would you look for to check that the job was completed successfully?