Test Runs

All the runs were submitted between 20th Dec 2009 & 4th Jan 2010. The JET cluster usage was comparatively light during this period.

Transp ID	LL Job #	Nproc	NPTCL/F	Tinit>Ftime	Comments	Nstep * NPTCL
50621K49	495132	7	5K / 5K	4.75 > 7	Called GRF3SG ( Interactive Graphics )
	495143	6			Called Toric, hung up in mastvm_ptri
	495178	8			Ran to completion (!)
	495211	16	25K / 25K	4.75 > 15.25	9.082 sec , 163 NUBEAM steps- Comms Failed	8.15
	495899	6	2.5K / 2.5K	4.75 >	Ftn Error - nubeam_allreduce_chunks_r8 call to uupper ( trim (),
	495945				Ftn Error - depall, iokill undefined
	496082				Ftn Error - saw_flatten, zndv_in undefined
	496368				Ftn Error - nubeam_step, ierr undefined
	496379				Ftn Error - Deallocate Pointer in xplasma2
	496870				Run Time Error
77694J05	494916	8	500K / 200K	7.5 > 17.	11.22 sec, 129 NUBEAM steps Fails immediately after writing restart files, but for no obvious reason - Cf 495094	90.3
77694J05	496411	12	500K / 200K	7.5 > 17.	11.395 sec, 136 NUBEAM steps - Readv Failed	95.2
77907J13	495094	36	1000K / 750K	5.3 > 9	6.2 sec, 30 NUBEAM steps - Failed - stale NFS handle	52.5
	495144	40			6.75 sec, 52 NUBEAM steps - Failed - mca comms failure	91.0
	495204	24			8.875 sec, 137 NUBEAM steps - Failed - connection to lifeline lost	239.8
	497225	24		8.875 > 9	9 sec, (141 - 137) Nubeam steps - Normal completion	7.0

MTBF

In all , 584M particles ( NPTCL+NPTCLF ) were processed, in 651 NUBEAM steps . With 4 MPI failures and 2 ( presumed ) NFS failures, the rate for MPI is 1 in 163 steps or 1 in 146M particles. NFS failures would be half this rate, but are probably better correlated to wallclock time * Number of processors, since there is no obvious mechanism for CPU activity on a node to cause NFS to misbehave.

The product NP*Nstep lies between 1600 and 3300 for the 4 jobs terminated by MPI failures - which might indicate that more processors increases the chances of error, although the statistics are too limited to draw any firm conclusions.

mpi_transp/dbtran - Test Runs

Transp ID

LL Job #

Nproc

NPTCL/F

Tinit>Ftime

Comments

MTBF