Peter Grandi wrote:
> I had some bizarre issue, ... but these two lines:
>
> /opt/glite/lib64
> /opt/lib4c/lib64
>
Weird, eh? Can't help there :(
>> The value set for BLAH_JOBID_PREFIX is cream02_
>> Error: BLAH_JOBID_PREFIX must be 6 chars long, begin with 'cr' and terminate with '_'. The other 3 characters must be alpha-numeric...
>> special incantations in the static LDIF files directory?
>>
We have two cream CEs. The BLAH_JOBID_PREFIX is cream_ on the first, and
cr002_ on the second. When a job runs on our pbs, its "Job Id" is the
usual format ([0-9]+\.hostname) , but the "Name" field is
cream_497376351, or cr002_563871294, depending where it came from. This
data is transferred into the accounting logs (which are shared between
all CEs). When APEL runs on (say) cream_, it needs to divide the records
into the appropriate type; it somehow uses BLAH_JOBID_PREFIX to
determine which job came from which CE, and tots it up accordingly. No
"splitting" etc. is required this way. No special incantations.
> Given that I have 3 CEs (2 LCG, one
> CREAM), all of them running jobs on the same Torque cluster,
> what's the best way to publish this setup with BDII?
Here's a few ideas I have learned. The original document (which is
stale) is still here:
http://map2.ph.liv.ac.uk/2010/03/22/capacity-publishing-and-accounting/
Background: We have one cluster used by three CEs (lcg-CE and 2 x Cream
CEs). We need to publish the true power of the cluster, and the
reference power for accounting (we don't want to use sub-clustering). We
have a heterogeneous cluster. Each worker node applies some scaling
factor (0..1) to make them all appear as if they have the same power
(for accounting). We use the HEPSPEC06 power of one core of a E5620,
which we rated at 15.22. All the usage figures that come out of worker
nodes (whatever their type) are corrected to that baseline.
Yaim Variables:
The configuration at our site contains two definitions like these:
CE_PHYSCPU=142
CE_LOGCPU=568
Also, there is this variable:
CE_OTHERDESCR=Cores=4,Benchmark=14.59-HEP-SPEC06
All our systems have 4 cores. I'm told you can put real-number average
in here, if you have various types. For “Benchmark”, this is another
HEP-SPEC06 value for our cores. It is an estimate of the power of one
“average/typical” logical cpu. You can get it by finding the HEP-SPEC06
for each type of system in your cluster. Use this to work out the
HEP-SPEC06 for all of the systems of that type, then add them all up.
Divide by the CE_LOGCPU value, giving an average value of the strength
of a single core. This is the value that goes in the Benchmark= variable.
I also want to make sure I can calculate the right amount of CPU used by
any particular job, via the accounting logs and the scaled times. The
relevant configuration variables for that are:
CE_SI00=3648
CE_CAPABILITY="CPUScalingReferenceSI00=3805 Share=atlas:63 Share=lhcb:25
glexec"
CE_SI00 is used to publish the physical computing power . To work it
out, take the Benchmark value (14.59), and convert it into “bogoSI00”
(don't ask!) by dividing it by 4 and multiplying the result by 1000,
giving 3648 on our cluster. This is the physical power of one core.
Next, to get the accounting right, we need to publish the
CE_CAPABILITY=CPUScalingReferenceSI00 variable. This value is used by
the APEL to work out how much CPU has been provided to a job. The value
is the reference value (15.22) converted into bogoSI00, giving 3805 for
our cluster.
Finally, we solve the problem of double/triple capacity counting by
setting CE_PHYSCPU=0 and CE_LOGCPU=0 on all CEs but one.
I think that works. No one complains, anyhow. I just hope that helps a bit.
Steve
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|