On 1/4/2011 11:28 AM, Greenberg, Naomi wrote:
> Bill,
> I inherited some code that was optimized for a Cray vector machine. There
> originally were very large computational loops that worked over large
> arrays. The person who vectorized it (very successfully) broke up the large
> loop into a series of small simple clear vectorizable loops that computed
> partials and then combined them later. He sized the partial arrays based on
> a vector size (R(nvec,2,3), with loops going in nvec chunks). My questions
> now are 1) what should nvec be set for an Itanium-based linux system or
> other 32 bit system (Intel compiler)? Is there a way to compute this
> automatically? 2) Is this the best way to do this still? Can I be hurt by it
> on other machines? Again, it's not just the loop counters that are sized,
> it's the actual data structure sizes also.
>
I would question why you would try to optimize for old Itanium (64-bit)
or 32-bit systems at this late date.
The most important points for SSE based systems are (when possible) to
set up the data structures so that the loops begin with 32-byte data
alignments, have a length which is a suitable multiple of the effective
unrolling factor chosen by the compiler, including vector register width
(typically 8 or 16), and make a suitable compromise between cache
locality and sufficient length to get up to speed. The cache locality
question is highly dependent on both the application and the platform.
--
Tim Prince
|