Further progress on "why is my mac so slow".
It turns out that the benchmark I sent roudn previously was completely
screwed by the use of
#define IMAX 1E8
This is a double not an int so each time round the loop the loop counter
was being cast to a double. Changing the code to
long IMAX = 1E8;
brings the mac into line (slightly faster).
I next tried "real world". KAPPA Fourier runs twice as fast on my linux
box as on my mac.
Further profiling of KAPPA Fourier indicated two results:
* Using memcpy or memmove is 50% faster than the equivalent
DO I = 1, NMAX
OUT(I) = IN(I)
END DO
On linux memcpy/memmove takes the same time as the normal loop.
The memcpy on OSX is hand crafter assembler but it looks like
linux glibc is straight C code.
* Almost all the time was spent extracting columns for the FFT.
This really kills the G5 since it always tries to prefetch the row.
I can't see any way of making this faster since the columns are
needed.
The recommended flags of -malign-natural and -falign-loops=16 have
miniscule effects.
Unfortunately, the biggest effect is the jumping around in a data array
and there is little I can think of to solve it.
The first issue can be "solved" though. My plan is to optimize VEC_xTOx
in PRM to call memmove. This should have no effect on systems that
implement memmove as a for loop but will have a big effect on the G5.
To that end I've done the following:
* Removed any calls to COPAR/COPAD (mainly kappa fourier it turns out)
and replaced with a call to VEC_xTOx.
* Replaced the loop in CCG1_COPA and CCG1_COPS with calls to VEC_xTOx
* I'm intending to remove copar.f and copad.f from kalibs since they
aren't used. copy2d.f is also unused.
Next up is to sort out VEC_xTOx. Followed by profiling some more to find
out which explicit loops in kappa can be replaced with a call to VEC_
Am I allowed to use "generic" code in PRM?
It seems that many loops are over 2-D image copies. I can optimize that
to a certain extent (using VEC_ or a wrapper around VEC_ if the number of
elements overloads an INTEGER). Unforuntately I can't think of an obvious
name for the function. It can't be VEC_ since it isn't vectorized
and the existing COPY2D in Kaplibs probably shouldn't stay in kaplibs
since we don't want a big kaplibs dependency for someone to do an array
copy. I was thinking of calling it ARR_xTOx and putting it in PRM
(IMG_xTOx would be confusing for a 2-D image optimization).
Something like:
ARR_xTOx( USEBAD, NDIMS, DIMS, LBND, UBND, INARR, OUTARR, IERR, NERR,
STATUS)
(with USEBAD, IERR and NERR simply there to match the prm interface for
VEC_). Then if LBND is all ones and UBND == DIMS then this is a vec_
copy, else it's a manual copy of a subarray (it's actually a VEC_
copy if the first NDIMS-1 dims match DIMS).
--
Tim Jenness
JAC software
http://www.jach.hawaii.edu/~timj
|