All the runs were submitted between 20th Dec 2009 & 4th Jan 2010. The JET cluster usage was comparatively light during this period.
Transp ID |
LL Job # |
Nproc |
NPTCL/F |
Tinit>Ftime |
Comments |
Nstep |
50621K49 | 495132 | 7 | 5K / 5K | 4.75 > 7 | Called GRF3SG ( Interactive Graphics ) | |
495143 | 6 | Called Toric, hung up in mastvm_ptri | ||||
495178 | 8 | Ran to completion (!) | ||||
495211 | 16 | 25K / 25K | 4.75 > 15.25 | 9.082 sec , 163 NUBEAM steps- Comms Failed |
8.15 | |
495899 | 6 | 2.5K / 2.5K | 4.75 > | Ftn Error - nubeam_allreduce_chunks_r8 call to uupper ( trim (), |
||
495945 | Ftn Error - depall, iokill undefined | |||||
496082 | Ftn Error - saw_flatten, zndv_in undefined | |||||
496368 | Ftn Error - nubeam_step, ierr undefined | |||||
496379 | Ftn Error - Deallocate Pointer in xplasma2 | |||||
496870 | Run Time Error | |||||
77694J05 | 494916 | 8 | 500K / 200K | 7.5 > 17. | 11.22 sec, 129 NUBEAM steps Fails immediately after writing restart files, but for no obvious reason - Cf 495094 |
90.3 |
496411 | 12 | 500K / 200K | 7.5 > 17. | 11.395 sec, 136 NUBEAM steps - Readv Failed |
95.2 | |
77907J13 | 495094 | 36 | 1000K / 750K | 5.3 > 9 | 6.2 sec, 30 NUBEAM steps - Failed - stale NFS handle |
52.5 |
495144 | 40 | 6.75 sec, 52 NUBEAM steps - Failed - mca comms failure |
91.0 | |||
495204 | 24 | 8.875 sec, 137 NUBEAM steps - Failed - connection to lifeline lost |
239.8 | |||
497225 | 24 | 8.875 > 9 | 9 sec, (141 - 137) Nubeam steps - Normal completion |
7.0 |
In all , 584M particles ( NPTCL+NPTCLF ) were processed, in 651 NUBEAM steps . With 4 MPI failures and 2 ( presumed ) NFS failures, the rate for MPI is 1 in 163 steps or 1 in 146M particles. NFS failures would be half this rate, but are probably better correlated to wallclock time * Number of processors, since there is no obvious mechanism for CPU activity on a node to cause NFS to misbehave.
The product NP*Nstep lies between 1600 and 3300 for the 4 jobs terminated by MPI failures - which might indicate that more processors increases the chances of error, although the statistics are too limited to draw any firm conclusions.