On 12/4/2013 9:57 AM, Anton Shterenlikht wrote:
> I've read MFE, sections 19.13.6 "The stat= and
> errmsg= specifiers in synchronization statements"
> and 19.14 "Program termination".
>
> I'm still confused about the differences
> between STOP and ERROR STOP, and between
> the normal and error terminations.
>
> This test program:
>
> use, intrinsic :: iso_fortran_env, only: stat_stopped_image
> implicit none
> real :: r(-1:10,-1:10)[0:4,*]
> integer :: lcob(2), ucob(2), img, nimgs, nco, errstat=0
>
> img = this_image()
> nimgs = num_images()
>
> lcob = lcobound(r)
> ucob = ucobound(r)
>
> if (this_image() .eq. 1) then
> write (*,"(a,i0,a)") "Running on ", nimgs, " images."
> write (*,"(4(a,i0),a)") "Cobounds: [", &
> lcob(1), ":", ucob(1), ",", lcob(2), ":", ucob(2), "]"
> nco = (ucob(1)-lcob(1)+1) * (ucob(2)-lcob(2)+1)
> write (*,"(a,i0)") "Number of unique cosubscript sets: ", nco
> if (nimgs .ne. nco) stop "num_images .ne.&
> & the number of unique cosubscript sets, aborting."
> end if
>
> sync all (stat = errstat)
> if (errstat .eq. stat_stopped_image) error stop stat_stopped_image
>
> write (*,*) "Image", img, "is continuing."
>
> end
>
> checks on image 1 if the number of images
> equals the number of unique cosubscript sets.
> If not, then STOP is executed.
>
> Then all images execute SYNC ALL (STAT = errstat),
> which, according to MFE, should have the
> effect "of executing the sync memory statement".
>
> Af far as I understood the STOP statement issued
> on image 1 will initiate normal termination.
> Because sync all statement uses stat= specifier,
> all image synchronise at this point, but termination
> is not completed here.
> Instead the program can proceed further, and now
> ERROR STOP is executed, which initiates error
> termination, which, again according to the book,
> "causes the whole calculation to stop as soon as possible".
>
> Is this the correct analysis?
>
> With Cray compiler, on 32 cores (processors), I get:
>
> STOP num_images .ne. the number of unique cosubscript sets, aborting.
> STOP
> Running on 32 images.
> Cobounds: [0:4,1:7]
> Number of unique cosubscript sets: 35
> aprun: Apid 6349317: Caught signal Terminated, sending to application
> _pmiu_daemon(SIGCHLD): [NID 01590] [c13-0c0s4n2] [Wed Dec 4 14:24:59 2013] PE RANK 2 exit signal Terminated
> Application 6349317 exit codes: 143
> Application 6349317 resources: utime ~949s, stime ~6s
>
> I presume the second "STOP" is written by
> error stop stat_stopped_image statement, right?
>
> Anyway, what bothers me is that the program spent
> all of the allocated 15 min (utime ~949s) before
> exiting following error stop.
> Why didn't it exit earlier?
>
>
> The same program compiled with Intel ifort v.14,
> and executed on 16 cores on a single node (shared
> memory) gives:
>
> Running on 16 images.
> Cobounds: [0:4,1:4]
> Number of unique cosubscript sets: 20
> num_images .ne. the number of unique cosubscript sets, aborting.
> Image 2 is continuing.
> Image 4 is continuing.
> Image 5 is continuing.
> Image 10 is continuing.
> Image 8 is continuing.
> Image 9 is continuing.
> Image 11 is continuing.
> Image 16 is continuing.
> Image 3 is continuing.
> Image 6 is continuing.
> Image 13 is continuing.
> Image 7 is continuing.
> Image 15 is continuing.
> Image 12 is continuing.
> Image 14 is continuing.
>
> and then it hangs forever.
> This is not a correct behaviour, right?
>
> Many thanks
>
> Anton
$ ifort as.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running
on Inte
l(R) 64, Version 14.0.1.139 Build 20131008
It appears to be failing to report the last started image or complete
starting of the next image after it starts images on all available cores:
Running on 4 images.
Cobounds: [0:4,1:1]
Number of unique cosubscript sets: 5
num_images .ne. the number of unique cosubscript sets, aborting.
Image 3 is continuing.
Image 4 is continuing.
Image 2 is continuing.
--
Tim Prince
|