Monitoring and Optimisation

Before submitting numerous (> 20 jobs), heavy (> 20 Gb of memory or > 4 CPUs), or long (120 hours running time) jobs, it is important to ensure that the resources requested — CPUs, memory and run time — are as closely as possible with the actual requirements of your scripts. If the actual usage is far below the requested values, CPUs will remain idle and memory will go unused.

Some tools have fairly constant resource requirements over time, while others fluctuate, making them difficult to predict. For this reason, it is best to avoid complex pipelines in which different steps have very different CPU or memory requirements within the same job. Instead, consider splitting such workflows into separate jobs, each with resource requests tailored to that step.

Resource optimisation is an iterative process that requires active monitoring and takes time. Regularly check the CPU and memory usage of your running jobs so you can refine future submissions. Exercise particular caution with jobs running overnight or at weekends, as you need to be able to intervene quickly if a job behaves unexpectedly or does not run efficiently. The following recommendations and tools should help you run your jobs efficiently and smoothly.

Please ensure your jobs are running correctly and monitor them regularly. Adjust the requested ⚙️ CPUs, 🔋memory, and ⏱️ runtime if resource usage is below the requested amounts. Additionally, help minimize overhead on the 📁file system and 📉scheduler.

To ensure your jobs run efficiently, follow the recommendations and use the tools listed below. When setting up new scripts or pipelines, test them on a small scale first before launching large batches. If you have any questions about optimising your runs, please get in touch.

Key resources to consider when running jobs

Optimization	Solution
📁Minimise file system overhead	Use `${SCRATCH}` and `${TMPDIR}`
📉Minimise job submission overhead	Ensure that jobs run for at least 5 minutes and run tests
🔧Increase the efficiency of your jobs	Adjust CPU and memory requirements and use `${SCRATCH}`/`${TMPDIR}`

Tools to monitore job efficency (⚙️ CPU, 🔋Memory, ⏱️runtime)

Job	Command/tool
Overview	`jview`
Running	`jeffrun -r` or `myjobs -r`
Finished (jobs since 24 hours)	`jeff24`
Finished	`jeff <Job-ID>/<Array-ID>`
Finished	Slurm job webGUI
Get inefficant jobs over a week	`jefflow`

We would like to do a SNP calling and have split the reference genome in 120 chunks.

To minimise I/O, we have merged the different samples (the different read groups will be used to separate the samples) by chromosome in our $SCRATCH directory.

Now we need to find out how many CPU and how much memory is needed for each of the 120 jobs.

Run a test with 5 representative chunks (do not ust take the first 5) with 4 CPUs and 4 X 0.5 Gb RAM and 24 hours run time.

#SBATCH --job-name=fb                #Name of the job   
#SBATCH --array=1,10,15,20,50%15   
#SBATCH --ntasks=1                   #Requesting 1 node (is always 1)
#SBATCH --cpus-per-task=4            #Requesting 4 CPU
#SBATCH --mem-per-cpu=500            #Requesting 0.5 Gb memory per core, 2 Gb in total 
#SBATCH --time=24:00:00              #Requesting 24 hours run time

The jobs getting killed after 2 min because of exceeding memory limits. Let's increase the requested memory to 4X2G.

#SBATCH --job-name=fb               #Name of the job   
#SBATCH --array=1,10,15,20,50%15   
#SBATCH --ntasks=1                  #Requesting 1 node (is always 1)
#SBATCH --cpus-per-task=4           #Requesting 4 CPU
#SBATCH --mem-per-cpu=2G            #Requesting 2 Gb memory per core, 8 Gb in total 
#SBATCH --time=24:00:00             #Requesting 24 hours run time

With jeffrun you get the real-time resource figures, use it regularly during the run. If the resource usage is completely off just kill and restart the job with more (or less) resources.

A summary of the resources used is not provided by slurm but you can use reportseff <JOB-ID> to get a summary of the finished jobs.

jeff  24765510

JobID	State	Elapsed	TimeEff	CPUEff	MemEff
24765510_1	COMPLETED	00:49:10	0.09%	30%	43%
24765510_10	COMPLETED	00:49:10	0.09%	31%	41%
24765510_15	COMPLETED	00:49:10	0.10%	35%	42%
24765510_20	COMPLETED	00:58:10	0.10%	29%	32%
24765510_50	COMPLETED	00:59:00	0.10%	32%	45%

In this example the run time is around 1 hour. The memory usage is around 30%. Of the requested 4 CPUs only about 1 were used.

Based on the test run we can now set the settings for the other 120 jobs (1 CPU and 1x3 Gb RAM for 4 hours).

#SBATCH --job-name=fb       #Name of the job   
#SBATCH --array=1-120%15   
#SBATCH --ntasks=1          #Requesting 1 node (is always 1)
#SBATCH --cpus-per-task=1   #Requesting 1 CPU
#SBATCH --mem-per-cpu=3G    #Requesting 3 Gb memory per core
#SBATCH --time=4:00:00      #Requesting 4 hours runtime

After all jobs are finished you can check which jobs have failed and rerun them again with 1X 5Gb memory.

#SBATCH --job-name=fb       #Name of the job   
#SBATCH --array=11,99%10
#SBATCH --ntasks=1          #Requesting 1 node (is always 1)
#SBATCH --cpus-per-task=1   #Requesting 1 CPU
#SBATCH --mem-per-cpu=5G    #Requesting 5 Gb memory per core 
#SBATCH --time=4:00:00      #Requesting 4 hours run time