Logfiles

Normal execution

>mpirun -np 4 -host jac-123,jac-122,jac-121,jac-120 mpiPi
/bin/tcsh running /home/jconboy/.tcshrc level 1
/bin/tcsh running /home/jconboy/.tcshrc level 1
/bin/tcsh running /home/jconboy/.tcshrc level 1
1 > MPI_BCAST
=========================================================
Enter the number of intervals: 
or Quit : 
Master / Slaves 
0 > FINALIZE / FINALIZE 
-1 > FINALIZE / BCAST 
-2 > ABORT / BCAST 
3 > MPI_BCAST
2 > MPI_BCAST
1000000
Enter 0 < nerr < 1000000 to provoke segv 
or 0 > nerr > -1000000 to call mpi_abort ( slave )
mpi_finalize ( master ) 
0
3 > MPI_BCAST
1 > MPI_BCAST
2 > MPI_BCAST
=========================================================
pi is approximately: 3.1415926535899030
Error is : 0.0000000000001101
=========================================================
Enter the number of intervals: 
or Quit : 
Master / Slaves 
0 > FINALIZE / FINALIZE 
-1 > FINALIZE / BCAST 
0 > MPI_BCAST
-2 > ABORT / BCAST 
0
1 > MPI_FINALIZE
2 > MPI_FINALIZE
=========================================================
3 > MPI_FINALIZE
0 > MPI_BCAST
0 > MPI_FINALIZE
0 MPI_FINALIZE > 0
1 MPI_FINALIZE > 0
2 MPI_FINALIZE > 0
3 MPI_FINALIZE > 0
>

1000000 step calculation, followed by 'normal' termination ( all processes in MPI_FINALIZE )

Abnormal termination ( FINALIZE / BCAST )

 pi is approximately: 3.1415926535899030
Error is : 0.0000000000001101
=========================================================
Enter the number of intervals: 
or Quit : 
Master / Slaves 
0 > FINALIZE / FINALIZE 
-1 > FINALIZE / BCAST 
-2 > ABORT / BCAST 
0 > MPI_BCAST
-1
1 > MPI_BCAST
2 > MPI_BCAST
3 > MPI_BCAST
=========================================================
0 > MPI_BCAST
0 > MPI_FINALIZE
jac-120
jconboy 12389 1 0 11:00 ? Ss 0:00 orted --bootproxy 1 --name 0.0.4 --num_procs 5 --vpid_start 0  
jconboy 12390 12389 89 11:00 ? R 3:03 \_ mpiPi
jac-121
jconboy 1152 1 0 11:00 ? Ss 0:00 orted --bootproxy 1 --name 0.0.3 --num_procs 5 --vpid_start 0 
jconboy 1153 1152 95 11:00 ? R 3:15 \_ mpiPi
jac-122
jconboy 12117 1 0 11:00 ? Ss 0:00 orted --bootproxy 1 --name 0.0.2 --num_procs 5 --vpid_start 0  
jconboy 12118 12117 99 11:00 ? R 3:26 \_ mpiPi
jac-123
jconboy 13825 2478 0 11:00 pts/5 S+ 0:00 | \_ mpirun -np 4 -host jac-123,jac-122,jac-121,jac-120 mpiPi

jconboy 13828 1 0 11:00 ? Ss 0:00 orted --bootproxy 1 --name 0.0.1 --num_procs 5 --vpid_start 0 
jconboy 13837 13828 0 11:00 ? S 0:00 \_ mpiPi

Abnormal termination - main calls FINALISE while slaves in BCAST wait

Abnormal termination ( ABORT / BCAST )

 pi is approximately: 3.1415926535899030
Error is : 0.0000000000001101
=========================================================
Enter the number of intervals:
or Quit :
Master / Slaves
0 > FINALIZE / FINALIZE
-1 > FINALIZE / BCAST
-2 > ABORT / BCAST
0 > MPI_BCAST
-2
2 > MPI_BCAST
1 > MPI_BCAST
3 > MPI_BCAST
=========================================================
0 > MPI_BCAST
0 > MPI_ABORT
[jac-123:14677] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -2
mpirun noticed that job rank 1 with PID 13442 on node jac-122 exited on signal 15 (Terminated).
2 additional processes aborted (not shown)

Error conditions

a) Segv

=========================================================
1 > MPI_BCAST
Enter the number of intervals: 
or Quit : 
Master / Slaves 
0 > FINALIZE / FINALIZE 
-1 > FINALIZE / BCAST 
-2 > ABORT / BCAST 
2 > MPI_BCAST
3 > MPI_BCAST
1000000
Enter 0 < nerr < 1000000 to provoke segv 
or 0 > nerr > -1000000 to call mpi_abort ( slave )
mpi_finalize ( master ) 
10003
Proc 1 error value 10003
Proc 3 error value 10003
Proc 2 error value 10003
=========================================================
Proc 0 error value 10003
2 > Segv 
[jac-121:03432] *** Process received signal ***
[jac-121:03432] Signal: Segmentation fault (11)
[jac-121:03432] Signal code: Address not mapped (1)
[jac-121:03432] Failing at address: 0x7c7aeb0
[jac-121:03432] [ 0] [0x57f440]
[jac-121:03432] [ 1] mpiPi(MAIN__+0x177) [0x80493a7]
[jac-121:03432] [ 2] mpiPi(main+0x39) [0x804921d]
[jac-121:03432] [ 3] /lib/libc.so.6(__libc_start_main+0xdc) [0x860f2c]
[jac-121:03432] [ 4] mpiPi [0x80491e1]
[jac-121:03432] *** End of error message ***
0 > MPI_BCAST
3 > MPI_BCAST
1 > MPI_BCAST
mpirun noticed that job rank 0 with PID 14977 on node jac-123 exited on signal 15 (Terminated). 
3 additional processes aborted (not shown)
>

Segv for process 2

b) Abort

=========================================================
Enter the number of intervals: 
or Quit : 
Master / Slaves 
0 > FINALIZE / FINALIZE 
-1 > FINALIZE / BCAST 
-2 > ABORT / BCAST 
3 > MPI_BCAST
2 > MPI_BCAST
1 > MPI_BCAST
1000000
Enter 0 < nerr < 1000000 to provoke segv 
or 0 > nerr > -1000000 to call mpi_abort ( slave )
mpi_finalize ( master ) 
-10004
Proc 1 error value -10004
Proc 2 error value -10004
Proc 3 error value -10004
=========================================================
Proc 0 error value -10004
1 > MPI_BCAST
2 > MPI_BCAST
0 > MPI_BCAST
3 > MPI_ABORT 
[jac-120:03448] MPI_ABORT invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode 1
mpirun noticed that job rank 0 with PID 15222 on node jac-123 exited on signal 15 (Terminated). 
2 additional processes aborted (not shown)
>

Termination by MPI_ABORT call during loop, process 3

c) Finalize

Enter the number of intervals: 
or Quit : 
Master / Slaves 
0 > FINALIZE / FINALIZE 
-1 > FINALIZE / BCAST 
-2 > ABORT / BCAST 
3 > MPI_BCAST
2 > MPI_BCAST
1000000
Enter 0 < nerr < 1000000 to provoke segv 
or 0 > nerr > -1000000 to call mpi_abort ( slave )
mpi_finalize ( master ) 
-10001
=========================================================
Proc 0 error value -10001
Proc 2 error value -10001
0 > MPI_BCAST
0 > MPI_FINALIZE 
Proc 1 error value -10001
Proc 3 error value -10001
1 > MPI_BCAST
3 > MPI_BCAST
2 > MPI_BCAST
^Z
Suspended
>kill %2
>mpirun: killing job...

mpirun noticed that job rank 0 with PID 15288 on node jac-123 exited on signal 15 (Terminated). 
3 additional processes aborted (not shown)
jac-120
jconboy 5004    1  0 11:32 ? Ss 0:00 orted --bootproxy 1 --name 0.0.4 --num_procs 5 --vpid_start 0
jconboy 5005 5004 88 11:32 ? R 0:23 \_ mpiPi
jac-121
jconboy 4906    1  0 11:32 ? Ss 0:00 orted --bootproxy 1 --name 0.0.3 --num_procs 5 --vpid_start 0
jconboy 4907 4906 91 11:32 ? R 0:25 \_ mpiPi
jac-122
jconboy 15395     1  0 11:32 ? Ss 0:00 orted --bootproxy 1 --name 0.0.2 --num_procs 5 --vpid_start 0 
jconboy 15396 15395 97 11:32 ? R 0:28 \_ mpiPi
jac-123
jconboy 15276  2478 0 11:32 pts/5 S+ 0:00 | \_ mpirun -np 4 -host jac-123,jac-122,jac-121,jac-120 mpiPi
--
jconboy 15279     1 0 11:32 ? Ss 0:00 orted --bootproxy 1 --name 0.0.1 --num_procs 5 --vpid_start 0 
jconboy 15288 15279 0 11:32 ? S 0:00 \_ mpiPi

Job hangs after MPI_FINALIZE call during loop, process 0