Transp / MPI - debugging runs

14-Jan-2010
jec

Several Transp/MPI test runs ( at JET ) failed towards the end of 2009, without enough recorded information to diagnose the cause. Debug versions of mpi_dbtran & later mpi_transp were run extensively over the Christmas break, & some possible causes have been identified.

The code

Debug versions of mpi_dbtran & mpi_transp were compiled & linked with these scripts -

>cat $SC/dbtran.csh
#! /bin/csh -f 
#
dbxclear
dbxadd dbport
dbxaddlib nbdebug
dbxaddlib nubeam
dbxaddlib trmpi
dbxaddlib portlib
dbxaddlib trcore
dbxaddlib heater
dbxaddlib trfpi
dbxaddlib outcor
#
cd $DBGDIR
#
uplink mpi_dbtran debug
#
pwd ; ls mpi_dbtran.dbx
mv -v mpi_dbtran.dbx $XE
#
# end dbtran.csh
>cat $SC/transp_dbx
#! /bin/csh -f
#
dbxclear
dbxadd port
dbxaddlib nubeam
dbxaddlib trmpi
dbxaddlib portlib
dbxaddlib trcore
dbxaddlib heater
dbxaddlib trfpi
dbxaddlib outcor
#
cd $DBGDIR
#
uplink mpi_transp debug
#
pwd ; ls mpi_transp.dbx
mv -v mpi_transp.dbx $XE
#
# end dbtran.csh

MPI version 1.4 was configured using the Lahey compiler, & --with-sge ( Sun Grid engine support ) Variables were set to gather more information after a crash -

setenv mpi_param_check 1
setenv mpi_keep_peer_hostnames 1
setenv mpi_abort_print_stack 1

All jobs ran on the JET batch machines ( quad core Sun X2200 servers )-

Compilation problems

The mpi_dbtran code was compiled with Lahey/Fujitsu Fortran 95 Express Release L6.20d. Problems were encountered ( in debug mode ) with calls of the form

call cstring( trim(fstring), cstring, 'F2C' )

where the compiler objects to the default declaration ( inout) of the 1st argument, a function value. cstring adds a terminating null to a Fortran string variable & the 1st argument is in fact input only ( in this instance ) , but is used as an output argument if the 3rd argument is 'C2F' & so cannot be declared as input. An additional routine cstr_F2C was written to perform conversion from Fortran to C only to avoid this. [ Some calls to uuper, ulower had similar problems with the trim function used in the argument list )

Runtime problems

Job 495132 failed when one MPI server process called an interactive graphical routine r8_grf3f1 ( from nbatom ). This code was included by the cpp switch _DEBUG, which I changed to _GDEBUG as a quick workaround. Might be better to skip this call if in batch mode ( if this can be detected at run time ). The reference to the graphics library sglib.a ( in mpi_dbtran_exe.link ) was replaced by sgdummy.a, but despite this substitution, there were still problems with mpi_dbtran calling graphics routines while running in batch mode, so from run 495211 onwards, the debug version of mpi_transp was used instead.

Running ( debug compiled ) mpi_transp revealed several trivial errors where uninitialised variables were used ( see below ), and one array deallocation error, which may be more serious.

Transp Runs

The environment variable mpi_param_check was set in the csh wrapper script before starting the executable. This should preserve more MPI debugging information -

mpi_param_check: If set to true (any positive value), and when Open MPI is compiled with parameter checking enabled (the default), the parameters to each MPI function can be passed through a series of correctness checks. Problems such as passing illegal values (e.g., NULL or MPI_DATATYPE_NULL or other "bad" values) will be discovered at run time and an MPI exception will be invoked (the default of which is to print a short message and abort the entire MPI job). If set to 0, these checks are disabled, slightly increasing performance.

16 runs were submitted on 3 shots, and are listed in this table. Six runs failed after processing significant amounts of data - one of these ( 495094 ) failed after an error from NFS, and a second ( 494916 ) failed at a similar stage ( just after writing the restart files ) but with no error reported. The remaining four runs appear to have suffered a communications failure between two or three nodes, typically ending with 'readv failed' errors of different varieties

Run 495094

This failed with an error message 'Stale NFS handle' when attempting to run the script codesys/csh/tr_replace_dir using a system call from subroutine outcor/wrstrt.The previous similar operation ( running csh/ascii_restart ) failed with a 'Stale NFS handle' error, indicating that the Transp system disk was no longer available on the node running process 0.

There are dozens of calls to jsystem from Transp, any of which could potentially acces NFS volumes. An immediate improvement would be to call logmod from portlib/jsystem, before & after the system call, & flush the logfile buffer, so any failures of this nature can be detected. The underlying problem might be fixed by copying any files and programs called via jsystem, to the local disks ( /tmp at JET ) of the processors running a job, & so avoid or reduce any reliance on NFS mounted disks.

*** FULL NEUTRAL SOURCE CALCULATION PERFORMED *** NSTEP= 597
TA= 6.20000E+00 CPU TIME= 5.00000E-02 SECONDS. DT= 6.25000E-04
/common/transp_shared/Code/transp/JET_49/codesys/csh/ascii_restart: Stale NFS file handle.
** warning: csh/ascii_restart error!
%ps_rezone_chk: no rezone operator available (no MHD equilibrium).
%ps_rezone_chk: no rezone operator available (no MHD equilibrium).
%ps_rezone_chk: no rezone operator available (no MHD equilibrium).
?trmpi_requestor: MPI_BCAST:-NUBEAM_SAVE
B>000.126209E+10 NUBEAM_SAVE
% nbi_states: fld_states write OK to filename: 77907J13_fi/77907J13_nbi_fld_state.cdf
sh: tr_replace_dir: command not found
?wrstf: tr_replace_dir 77907J13_rs_save 77907J13_rs_inPr
returned status = 127

?wrstf -- rename to save failed!

Run 494916 also fails at the same stage ( just after creating a new restart directory ), but without printing the error message, & so has probably failed for the same reason.

MPI problems

Run 495144

This run apparently suffered a communications failure between a group of nodes, on the 52nd nubeam step ( 6.75 sec ). Processes 0, 1, 2 & 3 ( on batch-023 ) ended with the error

 mpi_0/77907J13tr.log:[batch-023][[923,1],0]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: No route to host (113)

as did processes 9, 10 & 11 on batch-015. Process 8 on batch-015 ended with the similar error

mpi_8/77907J13tr.log:[batch-015][[923,1],8]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

This OpenMPI FAQ ( item 6 ) suggests that multiple TCP connections between nodes may be a possible cause - further investigation is needed.

Run 495204

Mpi_19
07Jan2010 20:26:51.76 !
[batch-010:15655] [[24652,1],19] routed:binomial: Connection to lifeline [[24652,0],4] lost

Mpi-11 ( portlib..log )
07Jan2010 20:26:51.76 !
[batch-004:28917] [[24652,1],11] routed:binomial: Connection to lifeline [[24652,0],2] lost

Mpi-3
07Jan2010 20:26:51.76 !
[batch-003:19587] [[24652,1],7] routed:binomial: Connection to lifeline [[24652,0],1] lost

Run 495211

Another communications failure ??

mpi-1, mpi-2 ( batch-001)

[batch-001][[28799,1],1]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv
: readv failed: Connection reset by peer (104)
[batch-001][[28799,1],1]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv
: readv failed: Connection reset by peer (104)

mpi-4 ( batch-005 ) & mpi-8 ( batch-015 )

[batch-005][[28799,1],4]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv
: readv failed: Connection reset by peer (104)
[batch-005][[28799,1],4]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv
: readv failed: Connection reset by peer (104)

[batch-005:01024] [[28799,1],4] routed:binomial: Connection to lifeline [[28799,0],1] lost

mpi_0/..tr.log file switches to Tx graphics mode before the end of the run, so it is unclear what was happening immediately before termination.

Run 496411

& again -

./mpi_1/77694J05tr.log:[batch013][[15570,1],1]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
./mpi_1/77694J05tr.log:[batch013][[15570,1],1]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
./mpi_2/77694J05tr.log:[batch-013][[15570,1],2]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
./mpi_2/77694J05tr.log:[batch-013][[15570,1],2]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
./mpi_4/77694J05tr.log:[batch-015][[15570,1],4]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
./mpi_4/77694J05tr.log:[batch-015][[15570,1],4]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
./mpi_8/77694J05tr.log:[batch-016][[15570,1],8]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
./mpi_8/77694J05tr.log:[batch-016][[15570,1],8]
[../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Fortran Errors - mpi_transp

Unless otherwise noted, the undefined values were set to the ( presumed default ) of zero. See comments in the updated source code for details.

Run 495899

There is an assignment to dummy argument 1, which is associated with a constant,
an expression that is not a variable, do variable or variable with intent(in) attribute actual argument.
Error occurs at or near line 1 of uupper_
Called from or near line 1551 of nubeam_allreduce_chunks_r8_
Called from or near line 234 of depall_buf_reduce_

Run 495945

The variable (iokill) has an undefined value.
Error occurs at or near line 488 of depall_
Called from or near line 206 of nubeam_step_child_

Run 496082

The variable (zndv_in) has an undefined value.
Error occurs at or near line 21 of saw_flatten_
Called from or near line 423 of tr_fptotr2_
Called from or near line 349 of tr_fptotr_
Called from or near line 586 of fimain_

Here the array znpre in trfpi/tr_fptotr2 has more elements than zones, so some elements are undefined. However, the problem was not resolved by presetting the array to zero, it was necessary to restrict the elements referenced in saw_flatten to those corresponding to the defined zones, which may indicate some further problem still exists.

Run 496368

The variable (ierr) has an undefined value.
Error occurs at or near line 62 of nubeam_step_
Called from or near line 5185 of nbi_pstypes.nubeam_ctrl_step_
Called from or near line 232 of nubeam_component_step_

Run 496379

A DEALLOCATE statement cannot be executed for a pointer (1-th argument:d%enames)
whose target was not created by an ALLOCATE statement.
jwe0019i-u The program was terminated abnormally with signal number SIGSEGV.

This error was traced to subroutine xplist_free in xplasma2/xplasma_obj. The existing code

if(associated(d%enames)) deallocate(d%enames)

was replaced with

if(associated(d%enames))      then
! if( allocated(d%enames)) &
   deallocate(d%enames,stat=istat)
   if( istat /= 0 ) &
      call errorlog( cSr//' -E- enames :',istat)
   d%enames => null() ! leaks ?? what leaks..
endif
!

This is not very satisfactory - the if(allocated()) test would not compile, & has been commented out, yet the deallocate statement was accepted.

Metcalf notes that the deallocate statement can be used only for a target allocated through a pointer, and explicitly excludes a pointer associated with an allocatable array. Unfortunately he doesn't go on to say what should be done in this case. However, the status test should prevent the program from terminating immediately, and the error log entry will help trace any resulting memory leak.

Run 496870

SGDUMMY -I- INIT called
*** glibc detected *** /common/transp_shared/Code/transp/JET_49/codesys/exe/mpi_transp.dbx:
double free or corruption (!
prev): 0x25a87030 ***
======= Backtrace: =========
/lib/libc.so.6[0x2fc3a4]
/lib/libc.so.6(cfree+0x96)[0x2fe356]
/usr/local/lib/libfj9i6.so.1(jwe_xfree_align+0x21)[0xb37d6ed1]
/usr/local/lib/libfj9i6.so.1(jwe_xbfh+0x1e)[0xb382b322]
/usr/local/lib/libfj9i6.so.1(jwe_etbf+0x294)[0xb38317c4]
/usr/local/lib/libfj9i6.so.1(jwe_etbs+0x143)[0xb38314f7]
/usr/local/lib/libfj9i6.so.1[0xb3831034]
/usr/local/lib/libfj9i6.so.1(jwe_etbm+0xf7)[0xb382fb4b]
/usr/local/lib/libfj9i6.so.1(errtra_+0x4d)[0xb3823c95]
/common/transp_shared/Code/transp/JET_49/codesys/exe/mpi_transp.dbx(initt_+0x17)[0x9e58c37]
/common/transp_shared/Code/transp/JET_49/codesys/exe/mpi_transp.dbx(fp_rfplot_+0x4c)[0x8815e6c]
/common/transp_shared/Code/transp/JET_49/codesys/exe/mpi_transp.dbx(rfprof_+0x804)[0x87bd874]
/common/transp_shared/Code/transp/JET_49/codesys/exe/mpi_transp.dbx(rfheater_+0x2685)[0x87b9b55]
/common/transp_shared/Code/transp/JET_49/codesys/exe/mpi_transp.dbx(fpprf_+0x574)[0x877e7b4]
/common/transp_shared/Code/transp/JET_49/codesys/exe/mpi_transp.dbx(fimain_+0xc20)[0x8705320]

This seems to occur intermittently when when fppmod/fp_rfplot calls ( a dummy version of ) INITT which in turn calls the lf95 routine ERRTRA to produce a stack trace. Not clear why any graphics routines are being called, so I commented out the fp_rfplot call as a workaround.

Proposed Improvements

MPI

To a first approximation, an MPI job might run for 2500/nproc Neutral Beam steps, ie around 200 steps for a 12 processor job. Since the jobs can be restarted, this amounts to a fairly major inconvenience, rather than a showstopper. Further investigations are continuing, with OpenMPI 1.3 3 ( as rebuilt with Lahey, & also using the default system libraries at runtime ). If this version shows similar problems, then a posting on the OpenMPI support pages might be an idea.

Transp

There are 3 areas where the code could be made more robust -

jsystem calls

allocate / deallocate statements

File open statement & file IO

jsystem calls should check that any script used is accessible from the calling program. Allocate & IO statements should use the status variable ( which in itself will allow the code to continue ), and write an unbuffered error message ( using logmod ), before terminating the run via bad_exit, which will call MPI_ABORT. [ This is necessary to stop OpenMPI jobs, which otherwise hang up, using CPU time & processor slots ]

Debug code. It wouldbe worthwhile to compile & run the debug version of the code , either more frequently or with a stricter compiler, to avoid an accumulation of trivial errors, and possibly identify any more serious problems.