Further progress on "why is my mac so slow". It turns out that the benchmark I sent roudn previously was completely screwed by the use of #define IMAX 1E8 This is a double not an int so each time round the loop the loop counter was being cast to a double. Changing the code to long IMAX = 1E8; brings the mac into line (slightly faster). I next tried "real world". KAPPA Fourier runs twice as fast on my linux box as on my mac. Further profiling of KAPPA Fourier indicated two results: * Using memcpy or memmove is 50% faster than the equivalent DO I = 1, NMAX OUT(I) = IN(I) END DO On linux memcpy/memmove takes the same time as the normal loop. The memcpy on OSX is hand crafter assembler but it looks like linux glibc is straight C code. * Almost all the time was spent extracting columns for the FFT. This really kills the G5 since it always tries to prefetch the row. I can't see any way of making this faster since the columns are needed. The recommended flags of -malign-natural and -falign-loops=16 have miniscule effects. Unfortunately, the biggest effect is the jumping around in a data array and there is little I can think of to solve it. The first issue can be "solved" though. My plan is to optimize VEC_xTOx in PRM to call memmove. This should have no effect on systems that implement memmove as a for loop but will have a big effect on the G5. To that end I've done the following: * Removed any calls to COPAR/COPAD (mainly kappa fourier it turns out) and replaced with a call to VEC_xTOx. * Replaced the loop in CCG1_COPA and CCG1_COPS with calls to VEC_xTOx * I'm intending to remove copar.f and copad.f from kalibs since they aren't used. copy2d.f is also unused. Next up is to sort out VEC_xTOx. Followed by profiling some more to find out which explicit loops in kappa can be replaced with a call to VEC_ Am I allowed to use "generic" code in PRM? It seems that many loops are over 2-D image copies. I can optimize that to a certain extent (using VEC_ or a wrapper around VEC_ if the number of elements overloads an INTEGER). Unforuntately I can't think of an obvious name for the function. It can't be VEC_ since it isn't vectorized and the existing COPY2D in Kaplibs probably shouldn't stay in kaplibs since we don't want a big kaplibs dependency for someone to do an array copy. I was thinking of calling it ARR_xTOx and putting it in PRM (IMG_xTOx would be confusing for a 2-D image optimization). Something like: ARR_xTOx( USEBAD, NDIMS, DIMS, LBND, UBND, INARR, OUTARR, IERR, NERR, STATUS) (with USEBAD, IERR and NERR simply there to match the prm interface for VEC_). Then if LBND is all ones and UBND == DIMS then this is a vec_ copy, else it's a manual copy of a subarray (it's actually a VEC_ copy if the first NDIMS-1 dims match DIMS). -- Tim Jenness JAC software http://www.jach.hawaii.edu/~timj