mpi_transp/dbtran - Test Runs

All the runs were submitted between 20th Dec 2009 & 4th Jan 2010. The JET cluster usage was comparatively light during this period.

Transp ID

LL Job #

Nproc

NPTCL/F

Tinit>Ftime

Comments

Nstep
*
NPTCL

50621K49 495132 7 5K / 5K 4.75 > 7 Called GRF3SG ( Interactive Graphics )
495143 6 Called Toric, hung up in mastvm_ptri
495178 8 Ran to completion (!)
495211 16 25K / 25K 4.75 > 15.25 9.082 sec , 163 NUBEAM steps-
Comms Failed
8.15
495899 6 2.5K / 2.5K 4.75 > Ftn Error - nubeam_allreduce_chunks_r8
call to uupper ( trim (),
495945 Ftn Error - depall, iokill undefined
496082 Ftn Error - saw_flatten, zndv_in undefined
496368 Ftn Error - nubeam_step, ierr undefined
496379 Ftn Error - Deallocate Pointer in xplasma2
496870 Run Time Error
77694J05 494916 8 500K / 200K 7.5 > 17. 11.22 sec, 129 NUBEAM steps
Fails immediately after writing restart files, but for no obvious reason -
Cf 495094
90.3
496411 12 500K / 200K 7.5 > 17. 11.395 sec, 136 NUBEAM steps -
Readv Failed
95.2
77907J13 495094 36 1000K / 750K 5.3 > 9 6.2 sec,  30 NUBEAM steps -
Failed
- stale NFS handle
52.5
495144 40 6.75 sec, 52 NUBEAM steps -
Failed - mca comms failure
91.0
495204 24 8.875 sec, 137 NUBEAM steps -
Failed
- connection to lifeline lost
239.8
497225 24 8.875 > 9 9 sec, (141 - 137) Nubeam steps -
 Normal completion
7.0

MTBF

In all , 584M particles ( NPTCL+NPTCLF ) were processed, in 651 NUBEAM steps . With 4 MPI failures and 2 ( presumed ) NFS failures, the rate for MPI is 1 in 163 steps or 1 in 146M particles. NFS failures would be half this rate, but are probably better correlated to wallclock time * Number of processors, since there is no obvious mechanism for CPU activity on a node to cause NFS to misbehave.

The product NP*Nstep lies between 1600 and 3300 for the 4 jobs terminated by MPI failures - which might indicate that more processors increases the chances of error, although the statistics are too limited to draw any firm conclusions.