LSF scheduler information specific to the luna cluster
By default all jobs are scheduled on the
sol queue. Jobs are scheduled based on their resource requests, estimated runtime, and user priorities. You can also specify the
test queue, at your own risk.
LSF host groups:
- largeHG: t01-t02
- internetHG: s01-s03
- testHG: u34-u35
LSF host partitions
Long vs. Short Jobs
- If a job specifies estimated run time,
-We, or hard run time,
-W, of 59 [minutes] or less, it is considered ‘short’.
- To run a short job you must specify
-We HOURS:MINUTESor -W HOURS:MINUTES
- Short jobs can run on all hosts. They have a hard run limit of 2 times the requested run time.
- All jobs are that do not request a run time of 59 minutes or less are considered ‘long’.
- Long jobs can run on 30% of commonHG hosts, and on 50% of
largeHGhosts if they request enough memory to be eligible.
Large memory hosts
largeHGhosts are for jobs that need a lot of memory.
- Short jobs, and long jobs that request more than 376gb of memory, can run on largeHG hosts.
- There is no
- Request the total amount of memory your job will use, not the fraction per job slot or thread.
internetHGhosts are for jobs that need
- Short jobs, as well as long jobs that request
internet, can run on
- To request an internet host use
Service Level Agreement Guarantees
- If you belong to an SLA group, you can specify it in your
bsubcommand, as in
bsub -sla Pipeline, to be guaranteed a certain amount of the cluster (so long as that portion isn’t in use by anyone else in the same SLA group.)
- All SLAs have a loan policy, which lets anyone’s short jobs use idle hosts.
- Guaranteed resources are assigned by full hosts, not CPUs/slots. If a job with -sla Pipeline is using 1 CPU/slot on a host, the
entirehost is reserved for Pipeline until Pipeline reaches its SLA.
- Current SLA breakdown on
Pipelinegets 40% of commonHG, 50%
internetHG, and 50%
- Haystack gets 10% of CommonHG
Short(short jobs auto attach to this) gets 20% of commonHG if there are no priority jobs.
- The remaining 30% of commonHG and 50% of largeHG is unallocated and available for long jobs.
Current defaults for jobs:
- Soft memory limits -R “rusage[mem=GB] and Hard memory limits -M GB should be requested.
- If none are requested, the default for soft is 8 GB and for hard is 16 GB.
- If hard is requested but soft is not, soft = hard.
-R "span[hosts=1]"Jobs that request multiple processors span a single host.
-R "rusage[iounits=1]" The maximum iounits per host is 10. IOUNITS are an arbitrary measure of the amount of reading/writing that the job incurs.
- -We specifies expected or “soft” runtime. -W species the hard run time. Jobs are considered long if -We or -W is not specified. Anything less than 60 minutes is considered a ‘short’ job, which can run on all nodes.
- If soft runtime (-We hour:minute) is set and hard (-W hour:minute) is not, hard runtime = 2x soft runtime.
- If hard runtime is set, soft runtime does not need to be.
- Long jobs are restricted to 30% of commonHG, or 50% of largeHG if you request more than 376gb of memory.
- stdout normally goes to
-o file. To redirect you must add quotes around the command to execute inside the
bsubcommand. For example:
bsub -We 1 -J jobName -o output_file.txt "ls -al 1> redirect_file.txt"
bsub -wis the wait option, as in
bsub -w "post_done($PREV_JOBNAME)"
- Auto-emailing is turned off, but can be enabled in the
post_done to hold jobs, instead of done, which may start too quickly. If holding on multiple jobs with very similar names, -w “post_done($PREV_JOBNAME*)” should work, unless you have one. This will only let the job run if $PREV_JOBNAME job completed with exit status 0,
and completed its
bsub sleep 30This submits a basic sleep job (sleeps for 30 seconds)
bsub -J jobname -We 0:30 -R "select[internet]" myjobSubmits job with job name “jobname” with an estimated runtime of 30 minutes, selecting for hosts with internet.
bsub -m commonHG -R “rusage[mem=20]” myjobSubmits jobs only to hosts in host group commonHG, with 20GB mem requested
How to send an email at the end of a job:
First the user must `export LSB_JOB_REPORT_MAIL=Y` on the terminal that they are going to submit their job.
Then they use bsub -u <email@example.com> -N
The -N means email the job output file (people usually write it to a file using -o) at the end of the job. This is what the e-mail will look like.
Job was submitted from host by user in cluster . Job was executed on host(s) , in queue , as user in cluster . was used as the home directory. was used as the working directory. Started at Tue May 24 11:14:37 2016 Results reported on Tue May 24 11:14:50 2016 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input sleep 13 ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 0.07 sec. Total Requested Memory : - Delta Memory : - Run time : 13 sec. Turnaround time : 14 sec. The output (if any) follows:
*Also a NOTE before using this: If you have that LSB_JOB_REPORT_MAIL=Y exported and do not put -u or -N ( and you don’t have -o or -oo), a message gets sent to you in the terminal at /var/mail/username and is only on the host that you ran the job on. In order to change it back just export LSB_JOB_REPORT_MAIL=N after you are done! If the users DON’T there is probably a potential to flood the /var/mail directories on the hosts with junk!
Memory Request Rules:
– Both Soft memory limits -R “rusage[mem=GB] and Hard memory limits -M GB should be requested.
– If none are requested the default for soft is 8 GB and for hard is 16 GB
– If hard is requested but soft is not: soft = hard
Runtime Request Rules for short jobs:
– If soft runtime -We hour:minute is set and hard -W hour:minute is not, hard runtime = 2x soft runtime
– If hard runtime is set, soft runtime does not need to be.
– There is no hard runtime for long jobs. A job is considered long if there is no runtime specified.
– If soft mem limit is less than small host threshold (376 GB), job is long (>60 minutes), and it does not have internet requested the jobs will only be submitted to “commonHG”.
– Queue test is exempt from these rules.