Print

Print


I've read MFE, sections 19.13.6 "The stat= and
errmsg= specifiers in synchronization statements"
and 19.14 "Program termination".

I'm still confused about the differences
between STOP and ERROR STOP, and between
the normal and error terminations.

This test program:

use, intrinsic :: iso_fortran_env, only: stat_stopped_image
implicit none
real :: r(-1:10,-1:10)[0:4,*]
integer :: lcob(2), ucob(2), img, nimgs, nco, errstat=0

img = this_image()
nimgs = num_images()

lcob = lcobound(r)
ucob = ucobound(r)

if (this_image() .eq. 1) then
  write (*,"(a,i0,a)") "Running on ", nimgs, " images."
  write (*,"(4(a,i0),a)") "Cobounds: [", &
    lcob(1), ":", ucob(1), ",", lcob(2), ":", ucob(2), "]"
  nco = (ucob(1)-lcob(1)+1) * (ucob(2)-lcob(2)+1)
  write (*,"(a,i0)") "Number of unique cosubscript sets: ", nco
  if (nimgs .ne. nco) stop "num_images .ne.&
    & the number of unique cosubscript sets, aborting."
end if

sync all (stat = errstat)
if (errstat .eq. stat_stopped_image) error stop stat_stopped_image 

write (*,*) "Image", img, "is continuing."

end

checks on image 1 if the number of images
equals the number of unique cosubscript sets.
If not, then STOP is executed.

Then all images execute SYNC ALL (STAT = errstat),
which, according to MFE, should have the
effect "of executing the sync memory statement".

Af far as I understood the STOP statement issued
on image 1 will initiate normal termination.
Because sync all statement uses stat= specifier,
all image synchronise at this point, but termination
is not completed here.
Instead the program can proceed further, and now
ERROR STOP is executed, which initiates error
termination, which, again according to the book,
"causes the whole calculation to stop as soon as possible".

Is this the correct analysis?

With Cray compiler, on 32 cores (processors), I get:

 STOP num_images .ne. the number of unique cosubscript sets, aborting.
 STOP
Running on 32 images.
Cobounds: [0:4,1:7]
Number of unique cosubscript sets: 35
aprun: Apid 6349317: Caught signal Terminated, sending to application
_pmiu_daemon(SIGCHLD): [NID 01590] [c13-0c0s4n2] [Wed Dec  4 14:24:59 2013] PE RANK 2 exit signal Terminated
Application 6349317 exit codes: 143
Application 6349317 resources: utime ~949s, stime ~6s

I presume the second "STOP" is written by
error stop stat_stopped_image statement, right?

Anyway, what bothers me is that the program spent
all of the allocated 15 min (utime ~949s) before
exiting following error stop.
Why didn't it exit earlier?


The same program compiled with Intel ifort v.14,
and executed on 16 cores on a single node (shared
memory) gives:

Running on 16 images.
Cobounds: [0:4,1:4]
Number of unique cosubscript sets: 20
num_images .ne. the number of unique cosubscript sets, aborting.
 Image           2 is continuing.
 Image           4 is continuing.
 Image           5 is continuing.
 Image          10 is continuing.
 Image           8 is continuing.
 Image           9 is continuing.
 Image          11 is continuing.
 Image          16 is continuing.
 Image           3 is continuing.
 Image           6 is continuing.
 Image          13 is continuing.
 Image           7 is continuing.
 Image          15 is continuing.
 Image          12 is continuing.
 Image          14 is continuing.

and then it hangs forever.
This is not a correct behaviour, right?

Many thanks

Anton