This is an interesting question, and there is one aspect that isn't
addressed in this response. The reference to MATMUL almost certainly
results in a call to a library procedure that isn't inlined. I've
always believed (I don't know where I picked this up, or if it's even
true) that procedure references usually end up trashing cache contents
because there is actually a lot of stuff that goes on behind the scenes
during procedure calls and returns (saving/restoring the stack, etc.)
Could someone comment on what impact procedure references have on cache
contents? I'd hope that for pure intrinsic procedures such as above
that it would be minimal, but it's not clear to me that this is the case.
A lot depends on processor architecture, F90 implementation, and
particulars of the declarations of actual and dummy arguments.
Modern-day processors often have separate caches for instructions
(I-cache) and data (D-cache), at least for a couple levels. Calling a
procedure will surely execute a different set of instructions and
therefore interact with the I-cache. So it depends on the size of the
resident code of the caller, the resident code of the callee, and the
size of the I-cache. Indeed, there is also some data involved with
saving registers on the stack and such, but a procedure call/return
does not "save the stack" it pushes/pops the stack. There are
typically under 50 registers in need of savingand caches are much
larger (on chip often over 10s of KB, on-board caches often reaching
into the MB range). I.e., the D-cache effects of a procedure call
should be minimal.
There are some cases of F90 procedure calls that require
copy-in/copy-out, which will play havoc with the D-cache. E.g., an
assumed shape actual (not guarenteed sequence associated) to an
explicit shape dummy (requires sequence association). The other
direction, explicit shape actual to assumed shape dummy, often requies
building a descriptor. This descriptor is generally small (dozens of
bytes) and lives on the stack (as opposed to requiring heap
alloc/free). I.e., the descriptor perturbs the D-cache only
minimally.
Now for MATMUL... Depends a lot on the compiler and the run-time. If
I were writing the runtime (I don't, but I've gotten within sneezing
distance) for when the compiler doesn't inline the MATMUL, I'd have it
take three descriptors (one for target, one each for the two arrays).
This would avoid all copy-in/copy-out. If I really wanted to go wild,
I might even have a version with explicit shape dummies that the
compiler would call when the actuals are all sequence associated, thus
invoking whatever optimizations can be done when sequence association
is guarenteed and without the copy-in/copy-out overhead.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|