Runtr - details

Runtr runs the main transp executable, <run>TR.EXE. It requires the namelist ( <run>TR.DAT ) and physics file ( <run>PH.DAT ) as input, and generates <run>XF.PLN, <run>YF.PLN, which contain 1D and 2D distributions of physics quantities. Jobs are submitted to the jac load leveller queue. The load leveller submits the job to the fastest available machine not already running batch work.

In order to minimise network traffic, and to avoid a common point of failure for all running transp jobs, all the files needed for a run are copied to the local disk of the target machine. At the end of the run, output files are copied back to the TWD. These local disks ( \tmp.. ) are only mounted on the machine to which they are connected, so it is necessary to log on to the server machine ( using rlogin or rsh .. ) to monitor the progress of the run.

Job submission

The command JETtransp step runtr initiates the following sequence of actions -

  1. A load leveller job file ( $TWD/_ll_ ) is created, to run the script runsys/JET_ll in the TWD. Standard and error output is to JETruntr.out , JETruntr.err.
  2. The file $TWD/_ll_ is submitted to the load leveller
  3. The load leveller runs runsys/JET_ll on the target machine.
  4. JET_ll reads the arguments required for the run from a file $TWD/_ll_args, and runs another perl script runsys/JET_runtr.
  5. JET_runtr then
    1. Creates the temporary Transp Server Directory ( $TSD ) on the transp server machine
    2. Exports the files required for the run to $TSD
    3. Executes transp - actually calling TrExe with argument runtr. Output from transp itself is written to the usual logfile - <run>tr.log
    4. Updates the run status file
    5. Imports all required output files back to $TWD
    6. Cancels the lock file which prevents most JETtransp operations while a run is in progress.

Job Status

There are several ways to find out what is happening to a job after it has been submitted -

  1. llq
    The llq command ( or better, llq | egrep <username> ) displays a list of loadleveller jobs with owners and machine on which they are running.
  2. xloadl &
    creates an X-window display, showing active jobs, and the loading of each machine in the cluster.
  3. JETtransp  stat
    will report the last time interval processed, and the last restart file written
  4. rsh jac-<nn> ls -l $TSD
    will list the transp server directory, which contains all the output files for a run. The node on which the job ran / is running is given in the first line of JET_runtr.out
  5. rsh jac-<nn> ps -f -u <username>
    will show all the transp processes running on the server machine

Problems

What to check if the job stops running, before all the requested time interval has been processed -

<run>tr.log

Check the last 1000 lines of the log file -
               tail -1000 $TWD/<run>tr.log | more

  • Is the equilibrium solver having difficulties ?
  • Are there problems with the neutral beam orbits ?
  • Does the run end with a progrm abort message ?

JET_runtr.err

Error messages ( syserr ) from the run end up in this file - fortran floating point errors, IO errors etc.

JET_runtr.out

Errors from the scripts ( file handling, disk quota problems ) will appear here.

Recovery

File system problems

All output files are written to the TSD under /tmp . They remain there for a few days, before being removed by the housekeeping processes, & so can be recovered if the automatic return fails ( eg due to disk quota problems ). To copy a file back to the TWD -

rsh jac-<nn> cp -v $TSD/<filename> $TWD

This command may also be used to recover other datafiles written by the RF and neutral beam packages, which are not returned by default.

Physics problems

  1. Identify the process causing the problem ( Eq solver, Neutral beam, RF.. ) from the log messages or fortran traceback. Contact the transp RO for assistance with this

  2. Inspect the input data associated with this process, particularly for the time period immediately preceeding the crash. Is further preparation needed ( Smoothing, spike removal ) ?  Should the timestep be reduced ?

  3. If the data appears normal, or cannot be further improved, there may be a bug in the code, or possibly namelist parameters which require fine tuning. For expert help with identifying the problem, repeat the run using the Fusion Grid at pppl. For JET runs which use mdsplus to read  the data, this is very easy - see the pppl help pages  .

Reporting problems

Report problems to the transp RO in the first instance ( by Email ).  Please include

  • The transp run identifier

  • The step at which the problem occured ( runtr, trdat .. )

  • Any error messages which were displayed. Cut'n'Paste these into the Email if possible.