Dear All
Sick: SL5.3 64-bit VM site-bdii that keeps hanging/freezing.
Has latest kernel & bdii software:
bdii-5.0.6-1.noarch
glite-BDII-3.2.8-0.sl5.x86_64
glite-yaim-bdii-4.0.5-1.noarch
DEBUG is on; the bdii hangs every few minutes; the last thing logged
before hang is always "Modify New Entries":
2010-04-30 14:47:07,734: [DEBUG] Sorting Add Keys
2010-04-30 14:47:07,734: [DEBUG] Writing ldif_add to disk
2010-04-30 14:47:07,734: [DEBUG] Adding New Entries
2010-04-30 14:47:07,750: [DEBUG] Logging Errors
2010-04-30 14:47:07,751: [WARNING] dn: o=glue
2010-04-30 14:47:07,751: [WARNING] ldapadd: Already exists (68)
2010-04-30 14:47:07,751: [WARNING] dn: glue2groupid=resource,o=glue
2010-04-30 14:47:07,751: [WARNING] ldapadd: Already exists (68)
2010-04-30 14:47:07,751: [DEBUG] Writing ldif_modify to disk
2010-04-30 14:47:07,752: [DEBUG] Modify New Entries
At that time, it's running slapd & bdii-update which spawns an ldapmodify
which spawns another ldapmodify; strace on that process shows
Process 25580 attached - interrupt to quit
write(2, "ldap_result: Can't contact LDAP "..., 44) = 44
write(3, "0\202\1\f\2\1Hf\202\1\5\4\201\203gluevoviewlocalid="..., 272) = 272
poll([{fd=3, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 100) = 0 (Timeout)
poll([{fd=3, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 100) = 0 (Timeout)
and top shows slapd has gone mad:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25329 edguser 15 0 723m 80m 74m S 82.9 2.7 2:31.90 slapd
netstat shows 102 CLOSE_WAIT - is that a lot? Is that what's causing it
to freeze/hang?
We have testbdii also SL5.3 VM on completely diff vm host.
What was interesting is that when lcgbdii was disabled (service bdii stop;
service network stop;) and testbdii changes its name+ip to that of lcgbdii,
it too shows exactly the same behaviour - freeze/hang after a few min.
Change back & forth a few times - consistent. Whichever VM is lcgbdii
active, freeze/hang after a few mnutes.
So it's not that VM since the other one does it too.
(To fix hang, a kludge = cron service bdii restart on lcgbdii every 3 min;
but still SAM tests fail now & then, site "vanishes"... bad)
Any ideas to debug? One suggestion was that if it was VM disk performance,
to mount it over NFS back to the VM host. That's been done; 0 difference.
Next: give up on VM & buy hardware? but if a 2nd VM also hangs when it
takes on lcgbdii name+IP, maybe problem is not VM per se. But what?
Other sites have 100% success SL5 VM site-bdii, 0 problems.
Rollout's advice was to use not-yet-production RPMS:
> Laurence Field proposes you try upgrading openldap as done at CERN:
> https://koji.afroditi.hellasgrid.gr/koji/buildinfo?buildID=348
>
> To make use of the new openldap you also need to upgrade the bdii rpm:
> https://savannah.cern.ch/patch/?3888
I'm uneasy to use non-production-ready software.
Advice welcome..
|