Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Slides from June 22 2017 UG: LSF_UG_1.potx

Queues

The LILAC cluster uses LSF (Load Sharing Facility) 10.1 FP8 from IBM to schedule jobs. The default LSF queue, ‘cpuqueue’, includes subset LILAC compute nodes and should be used for CPU jobs only. The gpuqueue queue should be used for GPU jobs only.

...

lw-gpu: lw01-02 (8xGPUs GeForce RTX 2080 Ti)


Job resource control enforcement in LSF with cgroups

LSF 10.1 makes use of Linux control groups (cgroups) to limit the CPU cores, number of GPUs and memory that a job can use. The goal is to isolate the jobs from each other and prevent them from consuming all the resources on a machine. All LSF job processes are controlled by the Linux cgroup system. Jobs can only access the GPUs which have been assigned to them. If the job processes on a host use more memory than it requested, the job will be terminated by the Linux cgroup memory sub-system.

LSF cluster level resources configuration (Apr 3 2018)

GPUs are consumable resources per host, not per slot. Job can request N CPUs and M GPUs per host, where N>M, N=M and N<M.

...

Memory is consumable resource in GB per slot (-n).

Job Submission 

LSF supports a variety of job submission techniques. By accurately requesting the resources you need, you can have your jobs execute as quickly as possible on available nodes which can process them.

...

Code Block
languagebash
bsub -n 1 -R “fscratch” …


More information on job submission and control

For more information on the commands to submit and manage jobs, please see the following page: Lilac LSF Commands

Simple Submission Script

There are default values for all batch parameters, but it is a good idea to always specify the number of threads, GPUs (if needed), memory per thread, and expected wall time for batch jobs. To minimize time spent waiting in the queue, specify the smallest wall time that will safely allow your jobs to complete.

...


Note that the memory requirement (-R rusage[mem=4])
 is in GB (gigabytes) and is PER CORE (-n) rather than per job. A total of 576GB of memory will be allocated for this example job.

Submission Example

Submit a batch script with the bsub command:

Code Block
bsub < myjob.lsf

Interactive  Jobs

Interactive batch jobs provide interactive access to compute resources, such as for debugging. You can run a batch-interactive using “bsub -Is”.
Here is an example command to create an interactive shell on a compute node:

Code Block
bsub -n 2 -W 2:00 -q gpuqueue -gpu "num=1" -R “span[hosts=1]”  -Is /bin/bash

GPU Jobs

LILAC GPUs offer several modes. All GPUs on LILAC are configured in EXCLUSIVE_PROCESS compute mode by default.

...


For more information on GPU resources, terminology and fine grain GPU control, please see the Lilac GPU Primer.

Requesting Different CPU and GPU Configurations

Warning: Please use -R "span[ptile=number_of_slots_per_host]" to get requested number of slots and requested number of GPUs on the same host, otherwise LSF may try to distribute the job among many hosts

...

Code Block
bsub -q gpuqueue -n N -gpu "num=2" -R "span[ptile=2]"


Parallel Jobs

LSF uses the blaunch framework (aka hydra) to allocate GPUs on execution nodes. The major versions of MPI integrate with blaunch.

Code Block
bsub -q gpuqueue -I -n 4 -gpu "num=1" -R "span[ptile=2] " blaunch 'hostname; echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES; my_executable”


Job options and cookbook

For the set of common bsub flags and more examples as cookbook, please see: LSF bsub flags

...