[log in to unmask] said:
> integer, dimension(1000) :: A, B, I
> ...
> A=B(I) ! Gather
> A(I)=B ! Scatter
[...]
> On a shared memory machine, knowing that I is a permuation would
> exclude read conflicts and this facilitate optimization greatly.
I might be missing some fine point here but apart from that I'd
name the above operations permutations and not gather/scatter I do
not exactly understand what read-conflicts would be optimizable
greatly.
There must be several out there who have been around longer than the
currrent cache-coherence protocol that are in use today. These people
surely will be able to tell stories about how hard things were in the good
old days. My ignorant view having been brought up with cache-coherent
parallel machines (or machines with no cache at all) is that the penalty
for reading the same position in memory or not is in this case negligible.
I could see a delay if the same memory was read in e.g. vector-computers
like Cray that uses memory banks. However, this same delay will happen if
two reads are made which are distant in linear memory but located in the
same memory bank due to mod(address(I1)-address(I2),#banks)==0. Similar
issues persumably occurs in more traditional memory architectures with
interleaved memory, but the problem is not that of addressing the same
memory location. Thus a compiler can not do a better job even if it knows
that the array I is a permutation.
In a more broadly speaking sense A=B(I) is (in my opinion) preferable to
A(I)=B since the former will allow the compiler to create code free from
false sharing. False sharing is the true enemy in SMP programming and it
has many disguises.
The form A(I)=B will perform good or bad not depending on the code that the
compiler generates, but more so depending on the contents of array I. The
same binary code will run fast if I is such that each CPU writes to local
cache lines not being invalidated by writes from other CPUs. It will run
slowly if this is not the case. Just like in the serial case it will run
fast if the code generated by the compiler and the content of I happens to
lead to a high ratio of cache hits in both A and B.
That being said there are of course exceptional cases where an array
assignment can be sped up if the content of the array I has certain
properties. The most important I'd guess are the ones when I is monotonic
and with equal distance between the values in I. Then the assignment might
be changed into memory copy instructions that might be more efficient.
Different compilers can detect this to different extent. But hiding such a
pattern in an indirect addressing mode such as A(I) will probably fool most
compilers, but I am only speculating.
In the case of distributed memory computing (such as the HPF model) it is
more tricky and will be dependent on the accessibility of the remote values
of A and B from earlier calculations and their use later in the code flow.
integer, dimension(1000) :: A, B, I
integer, dimension(100) :: IDX, TMPDATA
...
TMPDATA=A(IDX) ! Gather
B(IDX)=2*TMPDATA ! Scatter
A=3 ! Scatter
|