>
> Hi,
>
> A question to C. Coats first: Did you ever investigate the reason behind the
> performance degradation. Usually it is in the way you wrote the interface
> and the way you set up the variables you passed to it--it might be a copy
> overhead, or something of the sort. In general it can often be "blamed" on
> the programmer, not the standard...
You will note that I blamed compiler "quality of implementation" and
not the standard itself. But 0 for 4 on QOI is a major problem.
In this particular case, the formal and actual arguments were
dimensioned identically, by PARAMETERs--in fact, it is a cut-and-paste
job from one place to the other--something like the following (actually,
the dimension PARAMETERs were in INCLUDE files)
SUBROUTINE VDIFF
...
INTEGER, PARAMETER:: NLAYS = 31
INTEGER, PARAMETER:: NVARS = 58
REAL A( NLAYS )
REAL B( NVARS, NLAYS )
REAL C( NLAYS )
REAL Y( NLAYS )
...
CALL TRI( A, B, C, Y )
...
CONTAINS
SUBROUTINE TRI( A, B, C, Y )
INTEGER, PARAMETER:: NLAYS = 31
INTEGER, PARAMETER:: NVARS = 58
REAL, INTENT( IN ):: A( NLAYS )
REAL, INTENT( IN ):: B( NVARS, NLAYS )
REAL, INTENT( IN ):: C( NLAYS )
REAL, INTENT( INOUT ):: Y( NLAYS )
...
END SUBROUTINE TRI
END SUBROUTINE VDIFF
For the F77 version, SUBROUTINE TRI was separately-compiled,
stand-alone, and of course without INTENT clauses. And the call
was implemented by pass-by-reference in that case.
This is a case for which the compiler *really* *ought* to be able to
recognize that copy-in/copy-out is NOT necessary -- but the evidence
(and the machine code) indicates that none of these compilers did.
In fact (SUBROUTINE TRI being rather short), I really had expected the
compilers to implement this version by in-lining, achieving slightly
*better* performance than the F77 version did. (Manual inlining later
demonstrated a 5-10% speedup over the range of platforms involved...)
And, as Nick MacLaren has been pointing out this past week over on
"comp.arch", it was clear twenty years ago that computational cost
would shortly (relative to the Eighties) be dominated by memory
access time. It behooves the Fortran compiler writer to recognize
possibilities for pass-by-reference (and for inlining), so as to
minimize the use of copy-in/copy-out impleemntations of subroutine
calls.
And there is a facility I have _pleaded_ with F90 compiler writers
to provide me -- properly annotated listings, indicating the call
mechanism used for each argument. So far, I haven't received any
positive response.
Many of you will recall the output of Cray listings, with loopmarks
indicating vectorization (and vectorizatin-type), parallelization,
etc. I would like to extend this idea further, to provide call
mechanism annotation so that I cans ee when the compiler is performing
unnecessary memory traffic. I *don't* want to have to deal with PRAGMAs
that force the compiler to use particular mechanisms when I tell it to
do so; if I had to do that, I might as well be writing C ;-(
fwiw.
Carlie J. Coats, Jr. [log in to unmask]
MCNC Environmental Programs phone: (919)248-9241
North Carolina Supercomputing Center fax: (919)248-9245
3021 Cornwallis Road P. O. Box 12889
Research Triangle Park, N. C. 27709-2889 USA
"My opinions are my own, and I've got *lots* of them!"
|