On Saturday 22 February 2003 12:11, Nils Smeds wrote:
> [log in to unmask] said:
> > No production Windows or linux has dealt with the possibilities of
> > task scheduling on HyperThreaded MPI nodes, or using all the logical
> > processors without excessive message passing overhead.
>
> As far as I know, Linux schedules on hyper-threaded Xeon processors in
> the later 2.4 kernels. I am not definitely sure from which exact version
> it was introduced.
Basic HT scheduling came with kernel 2.4.17. I didn't want to rehash the
speculation on how much might be gained with the refinements in the 2.5 and
later kernels, only to point out that all the current OS leave some likely
improvements untouched. Major distros often discard new features of
development kernels anyway. Windows versions leave us more in the dark as to
which scheduling considerations are invoked, and which are ignored.
>In my experience hyper-threading is of limited gain
> for most traditional Fortran codes as they typically target the
> scientific computing domain with intense utilization of floating point
> computations, caches and the memory hierarchy. The limitation of sharing
> the same CPU package reduces e.g. the available memory bandwidth and
> other important performance parameters when the two logical processors
> are both used for the application.
>
> On the other hand, the second logical processor can be ready at hand for
> taking care of the asynchronous part of handling the MPI communication.
>
> I would expect, although I have no personal experience, that using N
> physical hyper-threaded CPUs would yield a better performance compared
> to using N non-hyper-threaded CPUs, in particular when N approaches the
> number of available CPUs. Trying to use all logical processors for
> numeric intensive applications will likely not be efficient unless the
> computations are of such a structure that they are not affected by the
> sharing of the common caches.
>
> /Nils
You are correct that sharing cache is among the limitations on the
performance which might be gained by increased parallelism with HT, and that
cache blocking parameters optimize differently with both logical processors
active.
As I understand it, current linux kernels do not permit the logical
processors to access each other's data in L1 cache, in order to alleviate the
"64K" aliasing and resulting cache evictions. I don't know the net effect in
cases of false sharing. If the processors are permitted to see each other's
L2 data, the L1 exclusion may not be so detrimental. False sharing between
logical processors seems to have a severe performance effect, where both
processors write into the same 64 byte line, or one reads and the other
writes to the same 128 byte line. As far as I know, this problem is the same
regardless of MP programming model (MPI, OpenMP, ...)
I was pointing out that MPI brings along the specific problem of additional
message passing overhead, if the additional logical processors are treated as
a facility for supporting additional MPI processes. In my experience,
'mpirun -np 2' on a single CPU P4 increases throughput by about 10% from -np
1, but that gain doesn't hold up for scaling to a large cluster with simple
interconnects. There seems to be a tacit acknowledgement that Windows
clusters may have more need than linux for proprietary fast interconnects; of
course that factors into the relative cost effectiveness.
--
Tim Prince
|