Hi Anton,
On 12/4/13 8:57 AM, Anton Shterenlikht wrote:
> I've read MFE, sections 19.13.6 "The stat= and
> errmsg= specifiers in synchronization statements"
> and 19.14 "Program termination".
>
> I'm still confused about the differences
> between STOP and ERROR STOP, and between
> the normal and error terminations.
>
> This test program:
>
> use, intrinsic :: iso_fortran_env, only: stat_stopped_image
> implicit none
> real :: r(-1:10,-1:10)[0:4,*]
> integer :: lcob(2), ucob(2), img, nimgs, nco, errstat=0
>
> img = this_image()
> nimgs = num_images()
>
> lcob = lcobound(r)
> ucob = ucobound(r)
>
> if (this_image() .eq. 1) then
> write (*,"(a,i0,a)") "Running on ", nimgs, " images."
> write (*,"(4(a,i0),a)") "Cobounds: [", &
> lcob(1), ":", ucob(1), ",", lcob(2), ":", ucob(2), "]"
> nco = (ucob(1)-lcob(1)+1) * (ucob(2)-lcob(2)+1)
> write (*,"(a,i0)") "Number of unique cosubscript sets: ", nco
> if (nimgs .ne. nco) stop "num_images .ne.&
> & the number of unique cosubscript sets, aborting."
> end if
>
> sync all (stat = errstat)
> if (errstat .eq. stat_stopped_image) error stop stat_stopped_image
>
> write (*,*) "Image", img, "is continuing."
>
> end
>
> checks on image 1 if the number of images
> equals the number of unique cosubscript sets.
> If not, then STOP is executed.
>
> Then all images execute SYNC ALL (STAT = errstat),
> which, according to MFE, should have the
> effect "of executing the sync memory statement".
>
> Af far as I understood the STOP statement issued
> on image 1 will initiate normal termination.
> Because sync all statement uses stat= specifier,
> all image synchronise at this point, but termination
> is not completed here.
> Instead the program can proceed further, and now
> ERROR STOP is executed, which initiates error
> termination, which, again according to the book,
> "causes the whole calculation to stop as soon as possible".
>
> Is this the correct analysis?
>
> With Cray compiler, on 32 cores (processors), I get:
>
> STOP num_images .ne. the number of unique cosubscript sets, aborting.
> STOP
> Running on 32 images.
> Cobounds: [0:4,1:7]
> Number of unique cosubscript sets: 35
> aprun: Apid 6349317: Caught signal Terminated, sending to application
> _pmiu_daemon(SIGCHLD): [NID 01590] [c13-0c0s4n2] [Wed Dec 4 14:24:59 2013] PE RANK 2 exit signal Terminated
> Application 6349317 exit codes: 143
> Application 6349317 resources: utime ~949s, stime ~6s
>
> I presume the second "STOP" is written by
> error stop stat_stopped_image statement, right?
>
> Anyway, what bothers me is that the program spent
> all of the allocated 15 min (utime ~949s) before
> exiting following error stop.
> Why didn't it exit earlier?
I suspect what happened is that the images other than 1 made it past the
"is there a stopped image" check in the SYNC ALL statement before image
1 executed STOP. This is a race condition that needs to be handled
inside the compiler-supplied pgas library. Your site has filed a bug
for this.
Cheers,
Bill
>
>
> The same program compiled with Intel ifort v.14,
> and executed on 16 cores on a single node (shared
> memory) gives:
>
> Running on 16 images.
> Cobounds: [0:4,1:4]
> Number of unique cosubscript sets: 20
> num_images .ne. the number of unique cosubscript sets, aborting.
> Image 2 is continuing.
> Image 4 is continuing.
> Image 5 is continuing.
> Image 10 is continuing.
> Image 8 is continuing.
> Image 9 is continuing.
> Image 11 is continuing.
> Image 16 is continuing.
> Image 3 is continuing.
> Image 6 is continuing.
> Image 13 is continuing.
> Image 7 is continuing.
> Image 15 is continuing.
> Image 12 is continuing.
> Image 14 is continuing.
>
> and then it hangs forever.
> This is not a correct behaviour, right?
>
> Many thanks
>
> Anton
>
--
Bill Long [log in to unmask]
Fortran Technical Support & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9142
Cray Inc./Cray Plaza, Suite 210/380 Jackson St./St. Paul, MN 55101
|