On Dec 1, 2014, at 7:56 AM, John Reid <[log in to unmask]> wrote:
> Anton Shterenlikht wrote:
>> I think I might have asked this question about a year ago.
>> There have been at least one new version of both Cray
>> and Intel compilers since then.
>> I'm still not sure what behaviour is correct.
>>
>> The program:
>>
>> use, intrinsic :: iso_fortran_env
>> implicit none
>> integer :: errstat=0
>> if ( this_image() .eq. 1 ) stop "kuku"
>> sync all ( stat=errstat )
>> if ( errstat .eq. stat_stopped_image) write (*,*) "mumu"
>> end
>>
>>
>> My understanding from MFE sec. 19.13.6
>> and FDIS (10-007r1) sec 8.5.7 par 2 is that
>> the correct behaviour of this program is:
>>
>> 1. Image 1 initiates normal termination
>> 2. At "sync all (stat=errstat)" errstat
>> becomes defined with "stat_stopped_image"
>> on all images, but image 1.
>> 3. Hence all images, but image 1, must output "mumu”.
Yes, this is what should happen. At least in the case of the Cray compiler, there is already a bug submitted about this issue.
The implementation challenge here is that image 2 can start the SYNC ALL execution before image 1 has executed STOP. So, initially image 2 is not aware that image 1 will STOP and not participate in the sync. Image 2 needs to wait for a stop signal to arrive, or to periodically query whether another image has stopped. This inquiry process needs some sort of time-out check so that after a period of time image 2 gives up and returns a status indicating time-out. A reasonable value for that wait time would depend on the nature of the program, so the user probably needs something like an environment variable to set.
Cheers,
Bill
>>
>> The critical point, for me at least, is that
>> with the use of (stat=errstat) the programmer can
>> avoid error termination and continue until
>> the end of the program on all images.
>>
>> Am I wrong?
>
> No, I think you are right and both the compilers have bugs. This is what the standard says (8.5.7):
>
> If the STAT= specier appears in a SYNC ALL or SYNC IMAGES statement and execution of one of these statements involves synchronization with an image that has initiated termination, the variable becomes defined
> with the value of the constant STAT_STOPPED_IMAGE (13.8.2.24) in the intrinsic module ISO_FORTRAN_ENV(13.8.2), and the effect of executing the statement is otherwise the same as that of executing the SYNC
> MEMORY statement.
>
> Best wishes,
>
> John Reid.
>
>>
>> However, I see with these 2 compilers:
>>
>> *************************
>> Intel 15.0.0 20140723:
>>
>> kuku
>>
>> Then nothing.
>> After waiting a minute, terminated with CTRL/C.
>>
>>
>> *************************
>> Cray 8.3.3:
>>
>>
>> STOP kuku
>> PE 2: ERROR: at least one image in current team is stopped (at or around line 5 in $main_() from file /home3/e347/e347/mexas/z.f90)
>> PE 3: ERROR: at least one image in current team is stopped (at or around line 5 in $main_() from file /home3/e347/e347/mexas/z.f90)
>> Application 12075703 is crashing. ATP analysis proceeding...
>>
>> ATP Stack walkback for Rank 2 starting:
>> [log in to unmask]:113
>> [log in to unmask]:242
>> [log in to unmask]:5
>> __pgas_sync_all@0x40921f
>> libpgas::Interface_Impl<false>::__pgas_sync_all(void*) const@0x4096e0
>> libpgas::Log::error(char const*, ...)@0x403dff
>> [log in to unmask]:92
>> [log in to unmask]:42
>> ATP Stack walkback for Rank 2 done
>> Process died with signal 6: 'Aborted'
>> Forcing core dumps of ranks 2, 1
>>
>>
>> Which of the behaviours is correct, if any?
>>
>> Thanks
>>
>> Anton
>>
Bill Long [log in to unmask]
Fortran Technical Suport & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9142
Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
|