Cluster Jobs Submission Guide

From Advanced Computing Facility
Jump to: navigation, search


Contents

PBS (Torque) Cluster Scripts

Submitting Jobs

To submit jobs to the cluster, you can write a script (named script.sh for example) and submit it with the command:
qsub script.sh

The script uses PBS parameters (beginning with #PBS) in the top portion, describing the parameters of the job, and lists the commands to run in the bottom portion.

Commands may also be piped into the Standard Input of qsub.
echo "echo Hello World!" | qsub

MOST Basic Example Cluster Script

echo Hello World!

You look at this example and probably notice it's practically empty and nothing special. That's correct. Values for every essential PBS parameter are setup by default. If you have existing scripts that can run with default values, no modification is necessary. The following default values will apply:
The job's name will be the same as the filename of the script.
The job will use the default queue.
The job will request 1 core with 1993MB of memory and 72 hrs of time.
The job will only send an email if the job aborts (an error). The email will be sent to the email address specified in the user's ~/.forward file.
The job will create separate output files for standard output and standard error.
The job's output filenames will match the job name, appended with .o(jobid) and .e(jobid).
The job will use the user's standard login shell.


Basic Example Cluster Script

#PBS -N JobName
#PBS -l nodes=1:ppn=1,mem=2000m,walltime=24:00:00
#PBS -M user1@ittc.ku.edu,user2@ittc.ku.edu
#PBS -m abe

echo Hello World!

This cluster job has name JobName, and it requests 1 core, 2000MB of memory, and 24 hrs of time. Emails will be sent to user1@ittc.ku.edu and user2@ittc.ku.edu (example addresses) when the job begins, ends, or aborts.

A Useful Example Cluster Script

#PBS -N JobName
#PBS -l nodes=1:ppn=1,mem=2000m,walltime=24:00:00
#PBS -M user@ittc.ku.edu
#PBS -m abe
#PBS -S /bin/bash
#PBS -d /users/user/working_directory
#PBS -e /users/user/logs/JobName.err
#PBS -o /users/user/logs/JobName.out

echo Hello World!

In this example, the shell "bash" is specified along with locations for logging the job's output and errors. The commands in the script are run in the directory specified by the "-d" flag. The locations specified may be absolute paths or paths relative to the current directory from which the job is submitted.

It is important to note that every option to #PBS has a corresponding command line option to "qsub". A full list of options can be found in the documentation to the command "qsub" at http://docs.adaptivecomputing.com/torque/5-1-1/help.htm#topics/torque/commands/qsub.htm

Some different types of job options are listed by topic below.

Cluster Script Topics

Queues

Queues are specified using the "-q" option to qsub or configured in your PBS script. For example, a job submitted to the "long" queue would have:

#PBS -q long

The ITTC cluster has 5 queues:

  • default
The default queue is used when you do not specify a queue. The default queue has a walltime limit of 1 week. Jobs in this queue have a walltime limit of 3 days unless specified.
  • long
The long queue has a minimum walltime limit of 1 week. There is no maximum walltime. It is up to users to set an acceptable maximum duration of time for jobs to complete.
  • bigm
The bigm queue is used to access the cluster "large memory" nodes. These nodes are a special resource to be used only when applications require large amounts of memory on a single node.
  • gpu
The gpu queue is used to access cluster nodes having NVIDIA GPU's. CUDA or OpenCL application support is required to make use of GPU computing resources.
  • phi
The phi queue is used to access cluster nodes having Intel Xeon Phi co-processors. To use the Xeon Phi co-processor, applications must be compiled with the Intel compiler using Xeon Phi offloading support or Xeon Phi native code support.
  • interactive
The interactive queue is used for using graphical interfaces and remote desktop on the cluster and for testing, debugging, and profiling cluster jobs. The "-I" flag (capital "i") to qsub must also be specified to start an interactive job. Jobs in the interactive queue have a walltime limit of 1 day.

Email Options

Some of the most important options are the ones controlling email notifications.
First of all, managing batches of jobs may require that you know when the jobs complete and how long each takes. Emails from the job ending can tell you how much memory was used and how long the job ran. You can use this information to request exactly the right amount of memory and walltime, which improves scheduler behavior.
Secondly, very large batches of jobs may generate a lot of emails that you do not want. You may want to disable emails completely when submitting a large batch of jobs.
Lastly, emails generated when a job aborts tell you information about why the job ended. Cluster jobs will be killed when they exceed the amount of memory or walltime requested.

The "-M" flag lets you set a comma-separated list of email addresses to be notified about the cluster jobs.

#PBS -M email_address,email_address2,email_address3


The "-m" flag accepts an argument of combinations of "a", "b", "e", or "n". These letter stand for Abort, Begin, End, and Never respectively. These options tell the cluster scheduler when to send an email to the recipients named in the "-M" option.

#PBS -m aben

The default option is to send emails only when the job is aborted. The most common option is to email when jobs abort or end.

#PBS -m ae

When submitting large batches of jobs (several thousand), you should have some kind of method to log errors and disable email notifications, including aborted jobs. The option to disable emails can be set on the command line as in "qsub -m n" or in the PBS script as

#PBS -m n

File Output

A cluster job produces text output to file descriptors STDOUT and STDERR. The text is held locally on the node until the job completes and the files are copied back to the location the job was submitted from. Several options are available to control the text output files and their locations.
In order to monitor the output from a cluster job, you must use output re-direction. Likewise, in order to discard the output from a cluster job, you must use output re-direction.

The job's standard output file location can be set by the "-o" flag. The supplied path can be a path relative to the current working direction when the job is submitted or an absolute path.

#PBS -o logs/output.log

The job's standard error file location is set with "-e". Once again the path may be a relative or absolute path.

#PBS -e logs/error.log

It is also useful during debugging to collect the errors and outputs in order. This can be done by joining the error and output files together with "-j oe".

#PBS -j oe
#PBS -o logs/combined.log


Output re-direction can be used to send either or both of STDOUT and STDERR to a file. That file may be /dev/null which quietly discards the text. You have to apply output re-direction to every command in your cluster script that generates output. Examples:

echo asdf >/tmp/test             
#This command writes "asdf" to a file named /tmp/test.  Errors are still written to STDERR.

ls /I/do/not/exist 2>/tmp/test 
#This writes the error "ls: cannot access /I/do/not/exist: No such file or directory" to /tmp/test.
#Regular command output is still written normally.

which nvcc 2>&1 >/tmp/test
#The errors are re-directed into STDOUT, then STDOUT is written to /tmp/test.

which nvcc 2>&1 >/dev/null
#This discards both errors and output.

Environment Variables

Torque provides access to the following useful environment variables:

PBS_NODEFILE         the name of the file containing the list of nodes assigned to the job. This file is used for all MPI cluster jobs.
PBS_O_WORKDIR     the absolute path of the current working directory of the qsub command.
PBS_ARRAYID           each member of a job array is assigned a unique identifier

Job Arrays

To submit a large number of similar cluster jobs, there are two basic approaches. A shell script can be used to repeatedly call qsub passing in a customized PBS script (either by creating a temporary PBS script file or by piping the PBS commands into qsub).

The preferred approach--that is simpler and potentially more powerfull--would be to submit a Job Array using one PBS script and a single call to qsub. Job arrays hand-off the management of large numbers of similar jobs to the Resource Manager and Scheduler and provide a mechanism that allows cluster users to reference an entire set of jobs as though it were a single cluster job.


Submitting Job Arrays

Job arrays are submitted by including the -t option in a call to qsub, or by including the #PBS -t command in your PBS script. The PBS -t option takes a comma-delimited list of job ID numbers or of one or more pairs of job ID numbers separated by a dash.

Each job in the job array will be launched with the same PBS script and in an identical environment--except for the value of its Array ID. The value of the Array ID for each job in a Job Array is stored in the PBS_ARRAYID environment variable.

For example, if a job array is submitted with 10 elements, numbered from 1 - 10, the PBS command would be the following:

 #PBS -t 1-10

An optional parameter, the slot limit, can be added to the end of the #PBS -t line to specify the maximum number of job array elements that can run at one time. The slot limit is specified by appending a "%" to the #PBS -t line followed by the slot limit value. A twelve element job array with non-sequential array IDs and a slot limit of 3 could be specified as follows:

 #PBS -t 1-3,5-7,9-11,13-15%3

Each job included in the job array has its own unique array element value stored in the PBS_ARRAYID environment variable. The value of each job array element's array ID can be accessed by the job script just like any other shell environment variable. If the job ran a bash shell script, the job's array ID information could printed to STDOUT using the following command:

 echo "Current job array element's Array ID: ${PBS_ARRAYID}"


Customizing Data for Job Array Elements

A more useful task for the array ID--and the real power of job arrays--would be to use the job's Array ID as a direct or indirect index into the data being processed by the job array.


  • One approach to accomplish this would be to use the PBS_ARRAYID value to provide a custom set of input parameters for job in the job array. To do this, a text file would be created containing multiple lines each of which would consist of a series of space delimited values. In this approach, each line in the data file would contain the input parameters needed by one element of the job array. The PBS script would then be modifed to include a command that would read in the correct line of the data file--based on the PBS_ARRAYID value of that particular job. While there are many ways to read the appropriate line from the data file, the following serves as a sample implementation assuming that the data file was called data.dat and was located in the same directory as the script that was run for each element of the job array:
 PARAMETERS=$(awk -v line=${PBS_ARRAYID} '{if (NR == line) { print $0; };}' ${PBS_O_WORKDIR}/data.dat)

Assuming that the excecutable program/script for the jobs in this array was called test.bash, the PBS script would launch the program with a line like the following:

 ${PBS_O_WORKDIR}/test.bash ${PARAMETERS} > ${PBS_O_WORKDIR}/${PBS_JOBID}.log

A sample PBS script for a Job Array based on these assumptions can be found on the page: Sample Job Array PBS Script


  • An alternate approach is possible if the unique input parameters needed by each job in the array can be calculated arithmetically. For example, if each instance of the test.bash script needed to loop over a range of values, the PBS script could calcuate the max and min values needed for each job directly--based on the value in the PBS_ARRAYID environment variable. If each job's range needed to include 1000 values, this could be done by including commands like the following in the PBS script:
 MAX=$(echo "${PBS_ARRAYID}*1000" | bc)
 MIN=$(echo "(${PBS_ARRAYID}-1)*1000" | bc)

The data file referred to above, data.dat, would not be needed in this approach, and the PBS script call to test.bash would be something like the following:

 ${PBS_O_WORKDIR}/test.bash ${MIN} ${MAX} > ${PBS_O_WORKDIR}/${PBS_JOBID}.log

A sample PBS script for a Job Array based on can be found on the page: Sample Job Array PBS Script without data file


Monitoring Job Arrays

To display information about a job array, the following command can be run from a login node (login1 or login2):

 "qstat -n1u <username>" 
    or 
 "qstat -nu <username"

To display information about all elements, the running elements, or the queued elements in a job array the following commands can be used:

 qstat -tn1u  <username>
 qstat -rtn1u <username>
 qstat -itn1u <username>

To display information about a specific job array element, the following command can be used:

 qstat -n1t <Job ID>[<Array Element>]
 for example, to display information about array element 3 of the job array with PBS_JOBID 6237485 the command would be "qstat -n1t 6237485[3]"


Managing Job Arrays

To end the proccessing of an entire job array, the following command can be used (from a login node):

 qdel <Job ID>[]

To end a specific, unwanted element of a jobs array, the following command can be used:

 qdel <Job ID>[<Array Element>]
 for example, to end execution of array element 3 of the job array with PBS_JOBID 6237485 the command would be "qdel 6237485[3]"



Rerunnable Jobs

By default, cluster jobs are considered rerunnable. This means that, if necessary, jobs can be returned to the queue after they have started running on a node (or nodes). These jobs would then be evaluated and scheduled by the cluster scheduler and launched by resource manager on the same or different node(s)--as though they were running for the first time. This is not something that happens frequently, but it can occur on rare occasions as a result of unplanned outages or other emergent situations.

In some cases, this default behavior is not desirable. For example, partial results generated by the incomplete job might be useful to the researcher.

To prevent a job from being restarted, the following line should be included in the job's PBS script:

#PBS -r n

The syntax needed to set the Rerunable attribute for a job submitted using qsub from the command line (without a PBS script) is as follows:

 qsub -r n -q <QUEUE> -l nodes=<NODES>:ppn=<CORES>,mem=<MEMORY>,walltime=<WALLTIME> -n <JOB NAME> -M <EMAIL> -m <EMAIL EVENTS>
for example, the following syntax could be used to submit a non-rerunnable 8 core job to the long queue: qsub -r n -q long -l nodes=1:ppn=8 -n "Test Job" -M username@ku.edu -m abe

Cluster Resource Requests

Torque resource request lists are handled with the "-l" flag.
The most important resources are nodes and cores. Number of nodes, number of cores per node, and node properties are joined with ":" as in the following example, requesting 4 nodes with 8 cores each and the ib (InfiniBand) property:

#PBS -l nodes=4:ppn=8:ib

Other resources such as total memory and walltime are specified on separate lines or separated by commas. This example requests 2 nodes with 12 cores each, 30000 MB of total memory, and 24 hours of walltime.

#PBS -l nodes=2:ppn=12,mem=30000m
#PBS -l walltime=24:00:00

Memory may be allocated in bytes (no suffix), kilobytes ("k"), megabytes ("m"), or gigabytes ("g"). For example, a job using 5 GB of memory may be requested with:

ITTC_Cluster_Jobs_Submission_Guide#Job_Arrays
#PBS -l mem=5g

Use caution when using memory requests in gigabytes. It is preferred to specify memory requests in 1000's of megabytes instead.

Walltime is the amount of time required for the job to run. Walltime may be specified in terms of Days, Hours, Minutes, and Seconds as DD:HH:MM:SS or in Hours, Minutes, and Seconds as HH:MM:SS

Hardware Specific Properties

Each type of hardware in the ITTC cluster has an associated property that can be used to submit jobs specifically to nodes of that type.

Some nodes in the cluster connect to each other and to ITTC cluster filesystems using an InfiniBand network. You may specify to use nodes that have InfiniBand ("ib") or those that specifically do not have InfiniBand ("noib"). If not specified, jobs may be assigned to either type of nodes.

#PBS -l nodes=2:ppn=4:ib
#PBS -l nodes=2:ppn=4:noib


The following types of Cluster jobs might need to specify--or could potentially benefit from specifying--resource attributes when they are submitted.

  • Cluster jobs that need to run on the same hardware each time:
    • Cluster jobs that perform some form of benchmarking
    • Cluster jobs that generate timing results that need to be correlated accross multiple runs
  • Cluster jobs which have specific hardware requirements
  • Cluster jobs that need to run on specific accelerator or co-processor cards
  • Cluster jobs that need low latency network interconnects
  • Cluster jobs that expect or prefer to run on InfiniBand
  • Cluster jobs that run on multiple nodes and could benefit from the close proximity of the nodes


The following resource attributes are defined with corresponding hardware configurations:

  • The cluster hardware resouce page has an exhaustive list of the hardware resource attributes available on the cluster.
Attribute CPU Vendor Description
 :intel Intel CPUs including:
 :amd AMD CPUs including:


Attribute Server Form Factor Description
 :blade Dell or IBM blade servers
 :sled Dell C6000 line sled servers
 :gpu Server hosts a GPU capable card
 :bigm Server has a "large memory" capability (16G/CPU core or 32G/CPU core)
 :bigc Server has a "large" core count (64 cores)
 :hefty Dell Poweredge 1450, Intel Xeon 2.8GHz 8-core with 16 GB of memory


Attribute Specific Node / Hardware Type Description
 :del_int_16_64 Dell m620 blades, Intel Xeon 16-core, M of memory
 :del_int_16_256 Dell m620 blades, Intel Xeon 16-core, M of memory
 :del_int_16_512 Dell m620 blades, Intel Xeon 16-core, M of memory
 :del_amd_64_256 Dell m615 blades, AMD Buldozer 64-core, M of memory
 :ibm_int_12_24 IBM blades, Intel Xeon 2.67GHz 12-core with 24018 MB of memory
 :ibm_int_16_256 IBM blades, Intel Xeon 2.67GHz 12-core with 24018 MB of memory
 :ibm_int_16_512 IBM blades, Intel Xeon 2.67GHz 12-core with 24018 MB of memory
 :del_int_8_16 Dell m600 blades, Intel Xeon 2.83GHz 8-core with 15950 MB of memory
 :del_amd_8_16 Dell m605 blades, AMD Opteron 2.4GHz 8-core with 16010 MB of memory
 :asu_int_12_32 Asus server hosting gpu devices
 :sup_int_12_32 SuperMicro server hosting gpu devices


Attribute Networking Hardware Description
 :ib Nodes with Infiniband connections
 :noib Nodes without Infiniband connections
 :eth_10G Nodes with 10G Ethernet connections
 :eth_1G Nodes with 1G Ethernet connections


Attribute CPU Architecture / Unique Features
 :avx
 :sse4_2
 :sse4_1
 :sse4a
 :sse3
 :p5100
 :m2070
 :vgl


Attribute Accelerator / Co-Processor Hardware Description
 :m2070
 :k20
 :kepler
 :mic


property Hardware description
rxcy A rack and chassis location ("x"=rack number, "y"=chassis number) common to a homogeneous group of nodes *
rxcynz A rack, chassis, and node location ("x"=rack number, "y"=chassis number, "z"=node number) specific to a unique node *
  • See cluster hardware resouce page for an exhaustive list of rxcy and rxcynz location specifications.


Examples:

When submitting a job that needs to run exclusively on a quad core machine, you may request all the cores and memory with the following resource request:

#PBS -l nodes=1:ppn=4:quad,mem=5852m
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox