The initial requirement is to produce a defined interface with the local job submission procedures, so that the Toric test cases can be run automatically following the build process. [ These jobs generally take too long to run from the command prompt ] In more detail, we require a procedure which
- Submits a serial or parallel job to the local batch system
- Allows the user to control submission parameters ( number of processors, notification .. )
- Does not require the user to edit any files
- Allows more than one job to be submitted concurrently, using the same input data.
Such a procedure would be generally useful in other cases, if it could replace the usual job-specific scripts which are edited for each run .
The prototype system described was developed to submit job to the JET LoadLeveller system . This requires two scripts -
- A job description, which is submitted to the job scheduler, and
- A ( unix shell ) script, which is a wrapper for the binary executable. This script also contains the MPI commands needed to start parallel processes.
The Lrun procedure prepares these two scripts, & inserts parameters as needed. The parameters are set by the script or makefile which submits the job, and may refer to ( shell ) environment variables. The parameters may also request user input - eg to define the number of processors to be requested for a parallel job
A prototype Lrun procedure has been written in perl, for test purposes. This is called with two arguments -
- A file pointer to inline data containing lines for the shell script, and
- A hash of symbol/value substitutions to be made in the output scripts
An example script, which runs the toric_main code distributed with Transp, is described here .
Evaluation of this & earlier prototypes showed that, while the Job description file submitted to the batch system was almost standard, and could be prepared using symbol/value substitution alone, the shell wrapper scripts for different codes tended to require different features. Consequently the skeleton file for the job description is stored in the Lrun module, while the wrapper file skeleton is provided as an argument.
More important than the actual implementation is the definition of an agreed
set of variables, particularly for the Job description file, so that programs
using Lrun are portable between sites. The skeleton Job description for a JET
batch job is
__DATA__ # ll script template # @ executable = &LLSCR& # @ input = &STDIN& # @ output = &STDOUT& # @ error = &STDERR& # @ initialdir = &RUNDIR& # @ notify_user = &USER& # @ notification = complete --P jobtype = openmpi --P max_processors = &NPMIN& --P min_processors = &NPMAX& # @ queue _END_ |
Variables are deliminated by the "&" character in the skeleton scripts. Lines starting "--P" are converted to script lines for parallel jobs, and skipped for serial jobs.
Table 1 lists the variables used in Lrun, and additional variables
which may be used in the wrapper scripts -
ARGS | Arguments for the Executable | |
EXE | Executable binary | |
EXEDIR | Directory containing EXE | |
INIT | Initialisation file | Defaults to /dev/null |
LLSCR | Wrapper script to be run | This script is generated by Lrun |
MPI | MPI flag {Y|N} | |
NPMIN | Minimum # processors required | MPI only |
NPMAX | Maximum # processors required | MPI only |
PID | (Fairly) Unique process ID | [ This is the ID of the perl process, which calls Lrun ] |
RUNDIR | Directory where the batch job runs | LLSCR file is written to this directory |
STDIN | Input file | |
STDOUT | Output file | |
STDERR | Error output file | |
USER | Account to notify on job completion |
Submitting parallel jobs requires additional lines in the job description file ( to control processor allocation ). The Toric submission script Runone.pl initiates a parallel run with the following hash definitions -
if( $mpi ) { $hash{MPI} = "Y"; $hash{NPMIN} = "?NPMIN"; $hash{NPMAX} = "?NPMAX"; } else { $hash{MPI} = "N";} |
The initial "?" in the NPMIN, NPMAX value fields cause Lrun to prompt the user to enter values for these parameters.
Lrun provides these facilities, in addition to the basic script editing operation -
A script which uses these features ( to submit the Toric test cases ) is described here
Error checking - Lrun checks that any environment variables referenced are defined, & will create RUNDIR if it does not exist ( to support creation of a new directory for each run )