Hi Anton,
On Apr 10, 2015, at 7:24 AM, Anton Shterenlikht <[log in to unmask]> wrote:
> This is a relatively large coarray/MPI program.
> Running on ARCHER, Cray XC30, archer.ac.uk.
>
> The program apparently deadlocks at >1000 cores.
>
> I understand a standard conforming coarray program
> can never deadlock.
> Is that correct?
Well, not exactly. A properly written one should not deadlock. But you could intentionally write a code that hangs. For example, image 1 executes SYNC IMAGES (2) and image 2 never executes a corresponding SYNC IMAGES statement. Image 1 will wait “forever” for image 2. Similarly, you can hang an MPI program by having one rank call MPI_Recv with no rank calling a corresponding MPI_Send.
>
> Is this statement still true in a coarray/MPI program?
The last time I looked, the MPI group had not updated their rules for interaction with the use of coarrays. However, people have been mixing coarrays and MPI for many years. The basic rule is to write the code in “phases” that are either all MPI or all coarrays, ending each MPI phase with MPI_Barrier and each coarray phase with SYNC ALL. MPI_Barrier would not normally do memory syncs for coarray operations that MPI has no information about, for example.
>
> The complete program is too large to reproduce here,
> but the key fragment is:
>
> sync all
> call sub( a , b , c , d )
> write (*,*) "image:", this_image(), "sub done"
> sync all
> write (*,*) "image:: ", this_image(), "passed sync all 2"
>
> All arguments to subroutine sub are INTENT(IN).
> Subroutine sub reads a coarray variable defined
> in the previous segment. Subroutine sub does not
> update this coarray variable. I belive this means
> the program is standard conforming (as far as coarray
> rules are concerned).
> Is that correct?
So far, it looks OK.
>
> The program works as expected at low core counts,
> and most times at 1500 cores. It never works at >10k cores.
> At runtime I get "image ... sub done" output from
> the vast majority of images, >99%, but not all of them.
> Then the program seems to stall until the queue time expires.
> No images get to the second write statement, so
> this means the program is stalling at the second SYNC ALL.
Usually if a code is OK at 1000 images and hangs at much larger image counts, there is something other than standard violation happening at large scale.
>
> The full text of subroutine sub is:
>
> subroutine sub( origin, rot, bcol, bcou )
> real( kind=rdef ), intent( in ) :: &
> origin(3), & ! origin of the "box" cs, in FE cs
> rot(3,3), & ! rotation tensor *from* FE cs *to* CA cs
> bcol(3), & ! lower phys. coords of the coarray on image
> bcou(3) ! upper phys. coords of the coarray on image
> integer :: errstat, i, j, nimgs, nelements
> real( kind=cgca_pfem_iwp ) :: cen_ca(3) ! 3D case only
> nimgs = num_images()
> allocate( lcentr( 0 ), stat=errstat )
> if ( errstat .ne. 0 ) error stop &
> "ERROR: cgca_pfem_cenc: cannot allocate( lcentr )"
> images: do i=1, nimgs
> nelements = size( cgca_pfem_centroid_tmp[i]%r, dim=2 )
> elements: do j = 1, nelements
> cen_ca = matmul( rot, cgca_pfem_centroid_tmp[i]%r(:,j) - origin )
This statement throws up big red flags. You have an outer loop that runs i = 1..num_images() more or less concurrently on every image, and referencing image i. Thus, for i = 1, every image in the program will be trying to access image 1 at the same time. For i = 2, all the images are suddenly pouncing on image 2. This sort of construct does not scale well. For a large number of images, it can create considerable congestion of the network at the point where the target image is attached.
Note that you could create this same sort of congestion using MPI. This is a general parallel programming consideration. Try to avoid creating hot spots.
If the algorithm allows (it appears so here), it is better to spread out the accesses more uniformly. Options include executing the iterations of the images: loop in a random order that is different on each image, or offsetting the start of the loop so it runs this_image()+1..num_images() with wrap around to 1…this_image() at the end.
> if ( all( cen_ca .ge. bcol ) .and. all( cen_ca .le. bcou ) ) &
> lcentr = (/ lcentr, mcen( i, j, cen_ca ) /)
> end do elements
> end do images
> end subroutine sub
>
> There are several module variable, but you can see that
> there is only a single coarray var, that is being read,
> not written. So I think the coarray segment rules are not
> violated.
No problem on that front.
>
> I wanted to confirm here that my understanding of the
> segment rules was correct and the fragment is indeed
> standard conforming.
>
> I wonder if the use of MPI in other parts of the program
> can have any effect on this seeming deadlock behaviour?
Probably not, especially if there is no issue at 1500 images.
Cheers,
Bill
>
> Any other advice?
>
> Does this sound like I need to submit a problem report?
>
> Thanks
>
> Anton
Bill Long [log in to unmask]
Fortran Technical Suport & voice: 651-605-9024
Bioinformatics Software Development fax: 651-605-9142
Cray Inc./ Cray Plaza, Suite 210/ 380 Jackson St./ St. Paul, MN 55101
|