Print

Print


On Fri, 28 Jan 2005, Maarten Litmaath wrote:

> It turns out that both RBs were using the same BDII, which got overloaded
> due to a large number of job submissions with complex JDLs each requiring
> a lot of BDII searches.  The upshot is that there is a backlog of hours...
> We will assign each RB a private BDII node and let you know ASAP.

Hello,

A part of the BDII configuration includes a 'cache' size for the slapd
process. It turns out that the default list of sites has reached the point
where it is just passing that limit. The result is that suddenly each
query to the BDII is taking significantly longer than before.

The number of queries that the broker makes to the BDII per job depends on
whether the job's JDL includes any attributes for which the broker has to
query the 'GlueSubCluster' record - eg.
'GlueHostApplicationSoftwareRunTimeEnvironment' or
'GlueHostBenchmarkSI00'. If the SubCluster record is needed the number of
queries is large, one for each sub cluster (although maybe it's worth
noting the queries are all done on one connection). Together with the
slower response meant that some jobs were taking several minutes to match.

We've just increased the limits here and the backlog on lxn1177 & lxn1188
is now clearing. We estimate it will take about 2 hours for the backlog on
lxn1188 to fully clear.

Everyone who runs a BDII is advised to reset the values to:

cachesize 30000
dbcachesize 30000000

in the configuration files /opt/lcg/etc/lcg-bdii-read-slapd.conf &
/opt/lcg/etc/lcg-bdii-write-slapd.conf. Of course the default values in
the distribution will be changed for the future.

Yours,
David

--
-------------------------------------------------------------------------
David Smith       e-mail: [log in to unmask]        tel: +41 22 76 74462
Address: D. Smith, CERN G06610, Bat 28 R-007, 1211 Geneva 23, Switzerland
-------------------------------------------------------------------------