Hi Fortran experts,
I have a question for those versed in compiler cache optimizations. I
have a specific matrix product to compute that looks like a stencil
operation in PDE solvers. I have at least two ways of doing it, and I am
wondering which is faster in terms of Fortran compiler optimization in
memory access and usage. I am using Lahey Fortran 95.
Assume we have arrays all of size L*L, called H, V and R. The last one
is the result of this operation I need to perform, the rest are inputs.
Let Tx and Ty be a similar temporary arrays.
I. First option:
This option computes R in two steps, both of which are stencil-like
operations:
1. Compute temporary arrays in something like (NOT strict Fortran):
FORALL i and j:
Tx(i,j) = H(i,j)*[ V(i,j) - V(i,j+1) ]
Ty(i,j) = H(i,j)*[ V(i+1,j) - V(i,j) ]
END FORALL
2. Now compute the result R in a stencil-like operation using Tx and
Ty (NOTE: We could put Tx and Ty into a single array, say of size 2N*N
if this can help the compiler set the stride constants right):
FORALL i and j:
R(i,j) = Tx(i,j-1) - Tx(i,j) + Ty(i,j) - Ty(i-1,j)
END FORALL
This approach has the disadvantage of needing the temporary arrays, but
these can of course be eliminated by merging the two FORALL's (is this
better for speed?):
FORALL i and j:
R(i,j) = H(i,j)*[ V(i+1,j) + V(i,j+1) - 2*V(i,j) ] + H(i,j-1)*[
V(i,j-1) - V(i,j) ] - H(i-1,j)*[ V(i,j) - V(i-1,j) ]
END FORALL
I did not do this above to clarify the process better (it is clear,
right?) and also, I will need two or more temprorary arrays or more
anyway in the rest of the algorithm, so that is not a problem (also, it
is better for ease of use in the rest of the algorithm to have the two
FORALL's separate). To me it seems like something compilers should be
able to optimize well?
II. Second Option
This approach merges the two FORALL's automatically and uses a slightly
different kind of memeory acces. It is the preferred choice in an
irregular case (when not all i and j are allowed and certain terms above
appear or dissappear irregularly), but in the regular case I am not
sure:
FORALL i and j:
! This is like Tx above:
temp = H(i,j)*[ V(i,j) - V(i,j+1) ]
R(i,j) = R(i,j) - temp
R(i+1,j) = R(i+1,j) + temp
! This is like Ty above:
temp = H(i,j)*[ V(i+1,j) - V(i,j) ]
R(i,j) = R(i,j) - temp
R(i,j+1) = R(i,j+1) + temp ! SOME signs MAY be wrong, I am too
tired to think carefully
END FORALL
In a sense, the two approaches compute the the first FORALL in the same
way, but approach the second differently (in terms of the specific
network optimization algorihtm I am working on, one corresponds to
traversing the network via edges, the other via the nodes).
So, which approach is preffered for full matrices H, V and R? They seem
almost identical to me!
Thanks a lot,
Aleksandar
--
_____________________________________________
Aleksandar Donev
[log in to unmask]
Physics Department
Michigan State University
East Lansing, MI 48825
(517) 432-6770
_____________________________________________
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|