The new Lilac cluster is now available to the general MSKCC research community.  The system is currently composed of 1,704 CPUs and 100 Nvidia GPUs.  The system is running the latest Centos 7 Linux operating system along with the LSF scheduler from IBM.  This sits on top of a large GPFS storage cluster for high performance data access.  

Slides from Sep 2017 UG: UG_09202017_final .pptx

Slides from June 22 2017 UG: LSF_UG_1.potx

Queues

The LILAC cluster uses LSF (Load Sharing Facility) 10.1  SP3 from IBM to schedule jobs. The default LSF queue, ‘cpuqueue’, includes all LILAC compute nodes and should be used for CPU jobs only. The gpuqueue queue should be used for GPU jobs only.

The cluster currently has five partitions: 85 nodes, 3,088 cores, 100 GPUs, 30 TB of RAM,  2.5 PB usable computational storage /data and 626TB usable warm storage /warm.

Queue name

Hosts

Number of slots

Max number of CPU-threads

per host

Default (thread/mem/walltme)

Max Walltime

Access

cpuqueue(default)

ls01-18 lg01-06 lp01-35 lt01-22



3,408

68

1/2GB/1hour

7 days

All users

gpuqueue

ls01-18 lg01-06 lp01-lp35 lt01-22 ld01-07 lv01

2,256


1/2GB/1hour

7 days

All users

long

lg01-06 ld01-05
72 (lg hosts) 80 (lp hosts)1/2GB/1hour30 days

ACL Fuchs Lab

*note.  Some subsets of compute servers are purchased by partner PIs or individual departments.  These systems are placed into the cluster under special queue configurations that enable prioritization for the contributing group. All users can run jobs with Walltime < 6 hours on lg-gpu, lp-gpu, lt-gpu and ld-gpu.  All users benefit from such systems as they allow jobs to run on them while they are idle or under low utilization.  The rules for group owned nodes within the LSF queuing system 

Job resource control enforcement in LSF with cgroups

LSF 10.1 makes use of Linux control groups (cgroups) to limit the CPU cores, number of GPUs and memory that a job can use. The goal is to isolate the jobs from each other and prevent them from consuming all the resources on a machine. All LSF job processes are controlled by the Linux cgroup system. Jobs can only access the GPUs which have been assigned to them. If the job processes on a host use more memory than it requested, the job will be terminated by the Linux cgroup memory sub-system.

LSF cluster level resources configuration (Apr 3 2018)

GPUs are consumable resources per host, not per slot. Job can request N CPUs and M GPUs per host, where N>M, N=M and N<M.

Please, check examples in Requesting Different CPU and GPU Configurations.

Memory is consumable resource in GB per slot (-n).

Job Submission 

LSF supports a variety of job submission techniques. By accurately requesting the resources you need, you can have your jobs execute as quickly as possible on available nodes which can process them.

A job consists of two parts: resource requests and job steps. A resource request consists of number of CPU threads, GPUs, wall time (expected duration), amount of memory per CPU thread, etc. Job steps specify tasks which must be done, software which must be run, etc.

Typically a user submits a job submission script which contains resource requests and job steps to LSF. A submission script is usually a shell (bash) script. The job will advance in the queue until it reaches the top. At this point LSF will allocate the requested cores, GPUs, and memory, and execute it.

Alternately users can submit jobs directly on the command line:

bsub -n1 -e %J.stdout -eo %J.stderr -W 4:00  my_executable


The -o flag without -e by itself will automatically combine the two logs into the single file. The %J variable is a special variable in the job submission which is replaced with your job ID.

More information on job submission and control

For more information on the commands to submit and manage jobs, please see the following page: Lilac LSF Job Submission

Simple Submission Script

There are default values for all batch parameters, but it is a good idea to always specify the number of threads, GPUs (if needed), memory per thread, and expected wall time for batch jobs. To minimize time spent waiting in the queue, specify the smallest wall time that will safely allow your jobs to complete.

#!/bin/bash
#BSUB -J myjobMPI
#BSUB -n 144
#BSUB -R span[ptile=72]
#BSUB -R rusage[mem=4]
#BSUB -W 04:00
#BSUB -o %J.stdout
#BSUB -eo %J.stderr

cd $LS_SUBCWD
mpirun -np 144 /my/bin/executable < my_data.in


Note that the memory requirement (-R rusage[mem=4])
 is in GB (gigabytes) and is PER CORE (-n) rather than per job. A total of 576GB of memory will be allocated for this example job.

Submission Example

Submit a batch script with the bsub command:

bsub < myjob.lsf

Interactive  Jobs

Interactive batch jobs provide interactive access to compute resources, such as for debugging. You can run a batch-interactive using “bsub -Is”.
Here is an example command to create an interactive shell on a compute node:

bsub -n 2 -W 2:00 -q gpuqueue -gpu "num=1" -R “span[hosts=1]”  -Is /bin/bash

GPU Jobs

LILAC GPUs offer several modes. All GPUs on LILAC are configured in EXCLUSIVE_PROCESS compute mode by default.

EXCLUSIVE_PROCESS means a GPU is assigned to only one process at a time, and individual process threads may submit concurrent work to the GPU. So only one context is allowed per GPU, which is usable from multiple threads simultaneously.

The other GPU compute mode is DEFAULT (or SHARED).

In DEFAULT mode, multiple processes can use the GPU simultaneously. Individual threads in each process may submit work to the GPU simultaneously -- multiple contexts are allowed on a GPU.

Nvidia Multi-Process Service (MPS) enables LILAC to support multiple contexts per GPU.

The MPS runtime architecture is designed to transparently enable cooperative multi-process CUDA applications to utilize Hyper-Q capabilities, which enables multiple CUDA kernels to processes concurrently on a single GPU.

Without MPS, processes sharing a GPU would need to be swapped on and off a GPU, which could only hold one context at a time. The MPS server shares one set of scheduling resources between all its clients, eliminating the overhead of swapping the GPU between contexts or processes.

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

The new LSF syntax to submit GPU job:

https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_command_ref/bsub.gpu.1.html#reference_kqd_3kh_xz

The default GPU settings are: "num=1:mode=exclusive_process:mps=no:j_exclusive=yes"

To submit GPU job with default GPU settings:

bsub -q gpuqueue -gpu - 


To submit MPS jobs:

bsub -q gpuqueue -n 1 -gpu "num=1:mps=yes" 

To submit job array of two jobs and request one GPU in Shared mode per job:

bsub -q gpuqueue -J "shared[1-2]" -n 1 -gpu "num=1:mode=shared:mps=no:j_exclusive=yes"


The LILAC cluster will include different GPU models, as well as non-GPU nodes Currently the LILAC nodes have GTX 1080, GTX TITANX and TITANX(Pascal) GPUs.

To check available GPU type per node:

lsload -s| grep gpu_model

To submit LSF jobs and request a particular GPU type:

bsub -q gpuqueue -n 1  -gpu "num=1" -R "select[gpu_model0=='GeForceGTX1080']"

To check current free GPUs on LILAC: 

bhosts -s ngpus_physical


The LILAC cluster includes different types of nodes. Currently we have five host groups:

ls-gpu: ls01-ls18

lt-gpu: lt01-lt08

lg-gpu: lg01-lg06

lp-gpu: lp01-35

ld-gpu: ld01-07

lv-gpu: lv01

To submit an LSF job to a host group or host groups:

bsub -m ls-gpu
bsub -m "ls-gpu lt-gpu lp-gpu lg-gpu"


For more information on GPU resources, terminology and fine grain GPU control, please see the Lilac GPU Primer.

Requesting Different CPU and GPU Configurations

Warning: Please use -R "span[ptile=number_of_slots_per_host]" to get requested number of slots and requested number of GPUs on the same host, otherwise LSF may try to distribute the job among many hosts

N CPUs and M GPUs on the same node (ls,lp,lt,lg nodes have 72 core-threads and 4 GPUs, ld nodes have 80 core-threads and 8 DGX GPUs)

bsub -q gpuqueue -n N -gpu "num=M" -R "span[ptile=N]"

One whole 72-thread node and 4 GPUs:

bsub -q gpuqueue -n 72 -gpu "num=4" -R "span[ptile=72] rusage[mem=6]"

One whole 80-thread node and 8 DGX GPUs:

bsub -q gpuqueue -n 80 -gpu "num=8" -R V100 -R "span[ptile=80] rusage[mem=6]"


N CPUs and N GPUs, using 1 CPU and 1 GPU per node:

bsub -q gpuqueue -n N -gpu "num=1"-R "span[ptile=1]"

N CPUs and N GPUs, using 2 CPUs and 2 GPUs per node:

bsub -q gpuqueue -n N -gpu "num=2" -R "span[ptile=2]"


Parallel Jobs

LSF uses the blaunch framework (aka hydra) to allocate GPUs on execution nodes. The major versions of MPI integrate with blaunch.

bsub -q gpuqueue -I -n 4 -gpu "num=1" -R "span[ptile=2] " blaunch 'hostname; echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES; my_executable”


Job options and cookbook

For the set of common bsub flags and more examples as cookbook, please see: LSF Resource Cookbook