This has come up before, but I'm not sure exactly what the conclusion was, so...
We've got some shiny new X5650s, 2 CPUs per node, 6 cores per CPU, plus
hyperthreading. Benchmarking these shows the hyperthreading is pretty effective
- going from 12 to 14 simultaneous runs gains an extra 16.8% HS06 in total, and
from 14 to 16 an extra 11.8%. (After that the increases slow and the increased
contention for I/O would likely outweigh what increase there is).
Consequently, we intend to initially run with 16 jobs per node and see how that
performs in real life.
This creates a bit of a headache in publishing. My understanding is that the
Glue Schema Usage doc (
https://twiki.cern.ch/twiki/pub/LCG/WLCGCommonComputingReadinessChallenges/WLCG_GlueSchemaUsage-1.8.pdf
- is this the latest?) explicitly forbids publishing 'overallocations':
"LogicalCPUs - defined as the “Total number of cores/hyperthreaded CPUs in the
SubCluster” In other words, LogicalCPUs counts the number of computing units
seen by the OS on the WNs. Sites typically configure one job slot per logical
CPU, but some sites allow more than this ... The 1.3 GLUE schema does not allow
such an over allocation to be published explicitly." and "System
administrators MUST set this variable to the value of the “Total number of
cores/hyperthreaded CPUs in the SubCluster"
As I see it we have several options. If we publish the number of cores (not
hyperthreads) and the appropriate HS06, the size and composition of the cluster
will be accurate, but our accounting would be somewhat overstated (as the HS06
per job would be lower with 16 jobs than it would be with 12).
Conversely, if we publish the number of hyperthreads and the appropriate HS06,
the size and composition of the cluster will also be accurate (for
hyperthreading), but our accounting would be significantly understated.
If we publish the number of cores, but the HS06 for 16 simultaneous jobs, then
the composition of the cluster and the accounting is accurate, but the size of
the cluster isn't.
The logical thing to do would seem to be to publish the number of jobs run per
CPU (8, in this instance) as cores/hyperthreads and the appropriate HS06, in
which case the total size of the cluster and the accounting would be accurate
(when we're full of jobs anyway), but the physical composition would not be
(which seems least important to me). But as mentioned, the schema seems to
forbid this.
So... what are we supposed to do?
--
Robert Fay [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/
|