On Dec 2, 2014, at 3:19 AM, Anton Shterenlikht <[log in to unmask]> wrote:
> I wanted to check whether I understand the
> logic of TS 18508 regarding failed and stalled
> images.
>
> My understanding is that trying to communicate
> with a failed image will make the invoking
> image stalled (sec. 5.9). Is that correct?
>
It depends on what you mean by “communicate”. If you are trying to execute a statement that has a STAT option (such as SYNC ALL) on image X and image Y in the team has failed, then the STAT returned on X will be for a failed image (unless there is also a stopped image in the team, in which case STAT_STOPPED_IMAGE wins). If the SYNC ALL statement does not have a STAT specifier and there is a failed image in the team, the program aborts. There is one case that has no syntax to detect a failed image (or other problems that could be reported by a STAT) - simple references or definitions (such as u = w[F] where image F has failed). It is this case that the separate class of STALLED is used. If image S tried to reference or define a variable on image F using the [ ] notation, and image F has failed, then image S becomes stalled.
> To prevent this one can use e.g. FAILED_IMAGES
> to know if any images have failed, and then
> avoid communicating with those (sec. 7.6.14).
> Is that correct?
That is a valid approach, though it would probably involve significant overhead if you did that before every reference. FAILED_IMAGES is mainly useful in the part of the code that is evaluating how (or if) to recover from a detected failure - i.e. code that would be executed rarely. If you wanted to check on the state of a particular image, for example before a loop that involved many references or a large transfer to that image, the IMAGE_STATUS function is a better option.
>
> It is not clear whether communication with a
> stalled image is possible. Is the idea that
> all coarray data can be copied from a stalled
> image to some reserve image, similar to A.1.2?
A stalled image is not executing additional statements, at least until the other images get to the END TEAM statement, at which point the stalled image can come back to life. Again, the answer depends on the scope of “communicate”. If other images try to execute a SYNC ALL, for example, the stalled image will not participate. On the other hand, remote reference and definition of variables on a stalled image should succeed, including atomic modifications. Operationally, a stalled image is similar to a STOPPED image. Remote reference and definition still work, but the stalled/stopped image is not executing program statements, so will not execute ones that are supposed to be executed collectively, except for END TEAM (for stalled) or STOP/END PROGRAM (for stopped).
Cheers,
Bill
>
> Thanks
>
> Anton
Bill Long [log in to unmask]
Fortran Technical Suport & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9142
Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
|