On Fri, 28 Jan 2005, Maarten Litmaath wrote: > It turns out that both RBs were using the same BDII, which got overloaded > due to a large number of job submissions with complex JDLs each requiring > a lot of BDII searches. The upshot is that there is a backlog of hours... > We will assign each RB a private BDII node and let you know ASAP. Hello, A part of the BDII configuration includes a 'cache' size for the slapd process. It turns out that the default list of sites has reached the point where it is just passing that limit. The result is that suddenly each query to the BDII is taking significantly longer than before. The number of queries that the broker makes to the BDII per job depends on whether the job's JDL includes any attributes for which the broker has to query the 'GlueSubCluster' record - eg. 'GlueHostApplicationSoftwareRunTimeEnvironment' or 'GlueHostBenchmarkSI00'. If the SubCluster record is needed the number of queries is large, one for each sub cluster (although maybe it's worth noting the queries are all done on one connection). Together with the slower response meant that some jobs were taking several minutes to match. We've just increased the limits here and the backlog on lxn1177 & lxn1188 is now clearing. We estimate it will take about 2 hours for the backlog on lxn1188 to fully clear. Everyone who runs a BDII is advised to reset the values to: cachesize 30000 dbcachesize 30000000 in the configuration files /opt/lcg/etc/lcg-bdii-read-slapd.conf & /opt/lcg/etc/lcg-bdii-write-slapd.conf. Of course the default values in the distribution will be changed for the future. Yours, David -- ------------------------------------------------------------------------- David Smith e-mail: [log in to unmask] tel: +41 22 76 74462 Address: D. Smith, CERN G06610, Bat 28 R-007, 1211 Geneva 23, Switzerland -------------------------------------------------------------------------