Dear List,
we have been suffering here in Budapest semmingly random, intermittent
replica-management failures in sft2 for quite a long time now. A few weeks
ago an announcement has been made that one of the CERN SEs that was used
in sft2 was proven unstable, and that had probably been the cause of rm
failures. The situation improved somewhat for a short time, but the
failures are occuring again. Frederic Schaer from CIC On Duty suggested
recently (thanks, Frederic) that the rm failures might be caused by an
unstable BDII. Now I looked into my BDII's update cycle, and found the
following strange symptoms:
Installed software: bdii-3.3.7-1_sl3, lcg-yaim-2.6.0-7
The BDII has a default configuration as generated by yaim.
The BDII lifecycle looks like this (one full cycle, times in seconds):
0-15: about 160 zombie bdii-update processes pop up gradually, CPU is
somewhat higher than otherwise, but still below 40%
15-30: these zombie processes hang around, CPU low (<10%)
30-35: the zombie processes gradually disappear, CPU still low
35-60: the original bdii-update process hogs the CPU at 99.9%
60-120: no zombies, the original bdii-update sleeps, CPU low again
I gathered this from multiple 'measurenments' with ps and top, the times
are approximate, of course. After the 1 minute pause, the whole cycle
starts anew.
The problem:
During these tests I also ran an ldapsearch to the BDII from a machine on
the same LAN with the following parameters:
$ ldapsearch -xL -s one -l 15 -h BDII_HOST -p 2170\
-b 'mds-vo-name=local,o=grid' 'modifyTimestamp'
These are the ones used by the GSTAT BDII-test, as described on
http://goc.grid.sinica.edu.tw/gstat/filter_help.html#BDIINode_Perf
I ran the queries during all the above described phases with the following
results:
- during the 'zombie-popup' phase, the query is slower than ususal, but
completes successfully
- during the 'CPU-hog' phase (bdii-update with 99.9% usage) the queries
are very slow and time out before finishing
- otherwise the queries complete instantaneously
I guess my problem is caused by the second case: during this timeframe the
machine is so unresponsive, that the queries might not complete at all - I
guess this is when I get occasional BDII errors in GSTAT and possibly
replica management errors in sft2.
My questions:
- is this behaviour considered normal? Do others also see the vast numbers
of zombie bdii-update processes and the close to 100% CPU-usage
afterwards?
- can it really cause my sft-rm failures?
As a sidenote, I have to mention that the BDII machine is an Athlon 1600XP
with 512M RAM. As there is no swap used right now, I suspect that memory
is not the problem. Any hints are appreciated,
Cheers
Szabolcs
|