Dear All,
This subject appears on the list with some regularity. I didn't
include this posting in one of the more recent threads because ask
a question here and I really would be happy to get some answers.
The subject in two sentences:
What is the best way to write array operations in F90:
- use intrinsic operators and precedures?
- use BLAS, esp. compueter vendor optimized BLAS?
- write good old DO loops?
How does the answer to the above depend on the way the array was
created?
As I'm quite curious about this for some time, I wrote a short
program which compares array operations with some BLAS level 1
procedures. The operands are all the same size (~8 times larger
than my cache) but created in different ways: static definition,
allocatable and pointer.
I used Sun Performance Library BLAS and Sun f90 compiler with
'-fast' optimization option.
These are the timing results (averaged over ~100 repetitions):
Operation| BLAS | Array |
| Static| Alloc| Pointer| Static| Alloc| Pointer|
COPY / =| 0.1286| 0.1335| 0.6910| 0.1280| 0.1331| 0.2851|
Dot prod| 0.0861| 0.0943| 0.6516| 0.0952| 0.1000| 0.1054|
AXPY/=*+| 0.1273| 0.1400| 0.6906| 0.1300| 0.1362| 0.3646|
SCAL/=* | 0.0881| 0.0889| 0.3647| 0.0896| 0.0945| 0.0940|
ASUM/.. | 0.1308| 0.1326| 0.4099| 0.3483| 0.3666| 0.3755|
The operations are BLAS procedures DCOPY, DDOT, etc. The array
operations are simple array expresions doing exactly the same
thing. If anyone's interested, I may post the program.
Judging by the stability of these results in the last repetitions,
I'd say the table is accurate in all but the last digit. So BLAS
DDOT *is* more efficient than intrinsic DOT_PRODUCT for static and
allocatable arrays.
All arrays are contiguous. All are passed as a whole (no sections,
esp. irregular). I see no reason for such a huge difference between
allocatable arrays and pointers.
The most important (I think) observations to make are:
1) static arrays are better than allocatables which in turn
are better than pointers,
2) BLAS operations on pointers incur huge penalty in all cases,
3) array operations on pointers sometimes incur large penalty,
4) complex array operations (i.e., ones that cause creation of
a temporary) are more expensive than corresp. BLAS, if
the arguments are not pointers.
Could someone explain me why this table looks like this? Does it
have to be so bad for BLAS/pointers?
My guess so far is the following:
- when pointer arguments are passed to BLAS routines (declared with
assumed size argments), they are copied on entry and exit.
- unnecessary creation of temporaries slows down calculation of some
array expressions.
And what about DO loops? I didn't experiment yet. Any advice?
Regards,
Artur Swietanowski
----------------------------------------------------------------------
Artur Swietanowski mailto:[log in to unmask]
Institut fuer Statistik, Operations Research und Computerverfahren,
Universitaet Wien, Universitaetsstr. 5, A-1010 Wien, Austria
tel. +43 (1) 407 63 55 - 120 fax +43 (1) 406 41 59
----------------------------------------------------------------------
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|