There are a lot of bdii-fwd processes hanging around ... i am not sure whether it is significant, but ps is reporting a lot of white space in the command: > lcgbdii 20836 0.0 0.1 7980 4440 pts/1 S 13:00 0:00 bdii-fwd [192.16.186.252:37581 <-- 127.0.0.1:2172] \n i count 94 space characters between the closing bracket and the end-of-line. The commands all seem to be padded out to 207 characters, whereas other commands aren't: selects only bdii-fwd commands: >>>> (stat,out)=commands.getstatusoutput('ps uaxwwwww | grep bdii-fwd') >>>> lines=out.split('\n') >>>> for l in lines: > ... print l[:61], len(l) > ... > lcgbdii 12951 0.0 0.1 7816 4368 pts/1 S 12:40 0:0 207 > lcgbdii 20797 0.0 0.1 7980 4440 pts/1 S 12:59 0:0 207 > lcgbdii 20836 0.0 0.1 7980 4440 pts/1 S 13:00 0:0 207 > lcgbdii 20849 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207 > lcgbdii 20853 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207 > lcgbdii 21420 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207 > lcgbdii 21425 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207 > lcgbdii 21430 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207 > lcgbdii 21437 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207 > lcgbdii 21447 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207 > lcgbdii 21455 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207 > lcgbdii 21494 0.0 0.1 7980 4384 pts/1 S 13:02 0:0 207 > lcgbdii 21495 0.0 0.1 7980 4384 pts/1 S 13:02 0:0 207 > lcgbdii 21498 0.0 0.1 7980 4384 pts/1 S 13:02 0:0 207 > root 24538 0.0 0.0 4196 984 pts/1 S 13:11 0:0 106 > root 24540 0.0 0.0 3680 664 pts/1 S 13:11 0:0 76 selects any lcg command: >>>> (stat,out)=commands.getstatusoutput('ps uaxwwwww | grep lcg') >>>> lines=out.split('\n') >>>> for n in range(12): > ... l = lines[n] > ... print l[:61], len(l) > ... > root 2382 0.0 0.0 4264 1668 ? S Oct05 1:4 186 > root 2384 0.0 0.0 4260 1660 ? S Oct05 1:4 192 > root 2422 0.0 0.0 4256 1660 ? S Oct05 1:4 194 > root 2547 0.0 0.0 4264 1668 ? S Oct05 1:5 195 > lcgbdii 21817 0.0 0.0 4192 1000 ? S Oct05 0:0 207 > lcgbdii 21850 0.0 0.0 5544 2156 ? S Oct05 0:0 155 > lcgbdii 12345 15.7 0.6 24096 20516 pts/1 S 12:40 5:1 130 > lcgbdii 12951 0.0 0.1 7816 4368 pts/1 S 12:40 0:0 207 > lcgbdii 20797 0.0 0.1 7980 4440 pts/1 S 12:59 0:0 207 > lcgbdii 20836 0.0 0.1 7980 4440 pts/1 S 13:00 0:0 207 > lcgbdii 20849 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207 > lcgbdii 20853 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207 Jeff Templon wrote: > Hi > > what about these lines in bdii.conf?? the second one looks very suspicious > > BDII_UPDATE_URL=http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf > > BDII_UPDATE_LDIF=http:// > > JT > > Jeff Templon wrote: > >> Hi >> >> very strange; i checked the conf file and the NS is configured to use >> bosheks (the RB machine itself) as BDII. >> >> if i do an ldapsearch to bosheks >> >> ldapsearch -h bosheks.nikhef.nl -p 2170 -x -b "o=grid" >> >> the command prints a bit of header information and then just stops. >> command prompt doesn't come back, and neither does the ream of site >> information. this would certainly cause the WM to slow down. >> >> Really strange because I typed it twice, the first time it ran fine, >> the second time it hung. >> >> a /etc/init.d/bdii restart >> >> appears not to help much, and the log files don't either ... >> >> from bdii-fwd.log: >> >>> 20051108_124543 Now forwarding to port 2173 (genNr 3) >>> 20051108_124543 Reaped process 15186 (genNr 2) >>> 20051108_124544 [Connect from 192.16.186.252:36107] >>> 20051108_124544 [Connecting to localhost...done] >>> 20051108_124544 Forked process 15821 -> 2173 >>> 20051108_124546 Reaped process 15821 (genNr 3) >>> 20051108_124717 Now forwarding to port 2171 (genNr 4) >>> 20051108_124717 [Connect from 192.16.186.252:36291] >>> 20051108_124717 [Connecting to localhost...done] >>> 20051108_124717 Forked process 16436 -> 2171 >>> 20051108_124722 Reaped process 16436 (genNr 4) >>> 20051108_124944 Now forwarding to port 2172 (genNr 5) >>> 20051108_124944 [Connect from 192.16.186.252:36653] >>> 20051108_124944 [Connecting to localhost...done] >>> ( >> >> >> >> and from bdii.log >> >>> TRIUMF-GC-LCG2: ldap_bind: Can't contact LDAP server >>> ru-Novgorod-NOVSU-LCG2: ldap_bind: Can't contact LDAP server >>> obsARMuk: ldap_bind: Can't contact LDAP server >>> Time for searches: 34 s >>> Time to sort: 1 s >>> Time to update DB: 4 s >>> Grabbing port 2170 for 2173 >>> Tue Nov 8 12:50:29 CET 2005 >> >> >> >> hmm the obsARMuk looks a bit suspicious. >> >> I will keep looking, any help appreciated. >> >> JT >> >> David Smith wrote: >> >>> On Tue, 8 Nov 2005, Jeff Templon wrote: >>> >>> >>>> our RB (bosheks) has been acting rather strangely today. from the >>>> WM daemon >>>> log file: >>> >>> >>> >>> [...] >>> >>> Hi Jeff, >>> >>> In principal the Timed out message means that there was a very slow >>> response from the BDII. (The query and response are normally on the 1 >>> second timescale). Which BDII does bosheks use? I notice that the >>> problem >>> seemed to appear at about 10:39 this morning and seems to have gone >>> again >>> since about 11:25. Do you know of any changes on the BDII between those >>> times? >>> >>> Yours, >>> David >>>