Print

Print


Further progress on "why is my mac so slow".

It turns out that the benchmark I sent roudn previously was completely 
screwed by the use of

   #define IMAX  1E8

This is a double not an int so each time round the loop the loop counter 
was being cast to a double. Changing the code to

   long IMAX = 1E8;

brings the mac into line (slightly faster).

I next tried "real world". KAPPA Fourier runs twice as fast on my linux 
box as on my mac.

Further profiling of KAPPA Fourier indicated two results:

   * Using memcpy or memmove is 50% faster than the equivalent

       DO I = 1, NMAX
         OUT(I) = IN(I)
       END DO

     On linux memcpy/memmove takes the same time as the normal loop.
     The memcpy on OSX is hand crafter assembler but it looks like
     linux glibc is straight C code.

   * Almost all the time was spent extracting columns for the FFT.
     This really kills the G5 since it always tries to prefetch the row.
     I can't see any way of making this faster since the columns are
     needed.

The recommended flags of -malign-natural and -falign-loops=16 have 
miniscule effects.

Unfortunately, the biggest effect is the jumping around in a data array 
and there is little I can think of to solve it.

The first issue can be "solved" though. My plan is to optimize VEC_xTOx
in PRM to call memmove. This should have no effect on systems that 
implement memmove as a for loop but will have a big effect on the G5.

To that end I've done the following:

  * Removed any calls to COPAR/COPAD (mainly kappa fourier it turns out)
    and replaced with a call to VEC_xTOx.

  * Replaced the loop in CCG1_COPA and CCG1_COPS with calls to VEC_xTOx

  * I'm intending to remove copar.f and copad.f from kalibs since they
    aren't used. copy2d.f is also unused.

Next up is to sort out VEC_xTOx. Followed by profiling some more to find 
out which explicit loops in kappa can be replaced with a call to VEC_

Am I allowed to use "generic" code in PRM?

It seems that many loops are over 2-D  image copies. I can optimize that 
to a certain extent (using VEC_ or a wrapper around VEC_ if the number of 
elements overloads an INTEGER). Unforuntately I can't think of an obvious 
name for the function. It can't be VEC_ since it isn't vectorized
and the existing COPY2D in Kaplibs probably shouldn't stay in kaplibs
since we don't want a big kaplibs dependency for someone to do an array 
copy. I was thinking of calling it ARR_xTOx and putting it in PRM 
(IMG_xTOx would be confusing for a 2-D image optimization).

Something like:

   ARR_xTOx( USEBAD, NDIMS, DIMS, LBND, UBND, INARR, OUTARR, IERR, NERR,
             STATUS)

(with USEBAD, IERR and NERR simply there to match the prm interface for 
VEC_). Then if LBND is all ones and UBND == DIMS then this is a vec_
copy, else it's a manual copy of a subarray (it's actually a VEC_
copy if the first NDIMS-1 dims match DIMS).

-- 
Tim Jenness
JAC software
http://www.jach.hawaii.edu/~timj