Tim Prince wrote:
> On 12/4/2013 9:57 AM, Anton Shterenlikht wrote:
>> I've read MFE, sections 19.13.6 "The stat= and
>> errmsg= specifiers in synchronization statements"
>> and 19.14 "Program termination".
>>
>> I'm still confused about the differences
>> between STOP and ERROR STOP, and between
>> the normal and error terminations.
>>
>> This test program:
>>
>> use, intrinsic :: iso_fortran_env, only: stat_stopped_image
>> implicit none
>> real :: r(-1:10,-1:10)[0:4,*]
>> integer :: lcob(2), ucob(2), img, nimgs, nco, errstat=0
>>
>> img = this_image()
>> nimgs = num_images()
>>
>> lcob = lcobound(r)
>> ucob = ucobound(r)
>>
>> if (this_image() .eq. 1) then
>> write (*,"(a,i0,a)") "Running on ", nimgs, " images."
>> write (*,"(4(a,i0),a)") "Cobounds: [", &
>> lcob(1), ":", ucob(1), ",", lcob(2), ":", ucob(2), "]"
>> nco = (ucob(1)-lcob(1)+1) * (ucob(2)-lcob(2)+1)
>> write (*,"(a,i0)") "Number of unique cosubscript sets: ", nco
>> if (nimgs .ne. nco) stop "num_images .ne.&
>> & the number of unique cosubscript sets, aborting."
>> end if
>>
>> sync all (stat = errstat)
>> if (errstat .eq. stat_stopped_image) error stop stat_stopped_image
>>
>> write (*,*) "Image", img, "is continuing."
>>
>> end
>>
>> checks on image 1 if the number of images
>> equals the number of unique cosubscript sets.
>> If not, then STOP is executed.
>>
>> Then all images execute SYNC ALL (STAT = errstat),
>> which, according to MFE, should have the
>> effect "of executing the sync memory statement".
>>
>> Af far as I understood the STOP statement issued
>> on image 1 will initiate normal termination.
Yes. Image 1 stops.
>> Because sync all statement uses stat= specifier,
>> all image synchronise at this point, but termination
>> is not completed here.
>> Instead the program can proceed further, and now
>> ERROR STOP is executed, which initiates error
>> termination, which, again according to the book,
>> "causes the whole calculation to stop as soon as possible".
>>
>> Is this the correct analysis?
Yes.
>> With Cray compiler, on 32 cores (processors), I get:
>>
>> STOP num_images .ne. the number of unique cosubscript sets, aborting.
>> STOP
>> Running on 32 images.
>> Cobounds: [0:4,1:7]
>> Number of unique cosubscript sets: 35
>> aprun: Apid 6349317: Caught signal Terminated, sending to application
>> _pmiu_daemon(SIGCHLD): [NID 01590] [c13-0c0s4n2] [Wed Dec 4 14:24:59
>> 2013] PE RANK 2 exit signal Terminated
>> Application 6349317 exit codes: 143
>> Application 6349317 resources: utime ~949s, stime ~6s
>>
>> I presume the second "STOP" is written by
>> error stop stat_stopped_image statement, right?
Yes. The standard is vague about exactly what is output on a stop, so I
think this output conforms to the standard.
>> Anyway, what bothers me is that the program spent
>> all of the allocated 15 min (utime ~949s) before
>> exiting following error stop.
>> Why didn't it exit earlier?
Good question. Tell Cray.
>> The same program compiled with Intel ifort v.14,
>> and executed on 16 cores on a single node (shared
>> memory) gives:
>>
>> Running on 16 images.
>> Cobounds: [0:4,1:4]
>> Number of unique cosubscript sets: 20
>> num_images .ne. the number of unique cosubscript sets, aborting.
>> Image 2 is continuing.
>> Image 4 is continuing.
>> Image 5 is continuing.
>> Image 10 is continuing.
>> Image 8 is continuing.
>> Image 9 is continuing.
>> Image 11 is continuing.
>> Image 16 is continuing.
>> Image 3 is continuing.
>> Image 6 is continuing.
>> Image 13 is continuing.
>> Image 7 is continuing.
>> Image 15 is continuing.
>> Image 12 is continuing.
>> Image 14 is continuing.
>>
>> and then it hangs forever.
>> This is not a correct behaviour, right?
Yes, I agree. I have the Intel compiler on a 4-core machine. I added a
write statement immediately after the sync all, which told me that
errstat was set to zero. This is a bug, I think. I will tell Intel.
Best wishes,
John Reid.
|