There are a lot of bdii-fwd processes hanging around ... i am not sure
whether it is significant, but ps is reporting a lot of white space in
the command:
> lcgbdii 20836 0.0 0.1 7980 4440 pts/1 S 13:00 0:00 bdii-fwd [192.16.186.252:37581 <-- 127.0.0.1:2172] \n
i count 94 space characters between the closing bracket and the
end-of-line. The commands all seem to be padded out to 207 characters,
whereas other commands aren't:
selects only bdii-fwd commands:
>>>> (stat,out)=commands.getstatusoutput('ps uaxwwwww | grep bdii-fwd')
>>>> lines=out.split('\n')
>>>> for l in lines:
> ... print l[:61], len(l)
> ...
> lcgbdii 12951 0.0 0.1 7816 4368 pts/1 S 12:40 0:0 207
> lcgbdii 20797 0.0 0.1 7980 4440 pts/1 S 12:59 0:0 207
> lcgbdii 20836 0.0 0.1 7980 4440 pts/1 S 13:00 0:0 207
> lcgbdii 20849 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207
> lcgbdii 20853 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207
> lcgbdii 21420 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207
> lcgbdii 21425 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207
> lcgbdii 21430 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207
> lcgbdii 21437 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207
> lcgbdii 21447 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207
> lcgbdii 21455 0.0 0.1 7980 4384 pts/1 S 13:01 0:0 207
> lcgbdii 21494 0.0 0.1 7980 4384 pts/1 S 13:02 0:0 207
> lcgbdii 21495 0.0 0.1 7980 4384 pts/1 S 13:02 0:0 207
> lcgbdii 21498 0.0 0.1 7980 4384 pts/1 S 13:02 0:0 207
> root 24538 0.0 0.0 4196 984 pts/1 S 13:11 0:0 106
> root 24540 0.0 0.0 3680 664 pts/1 S 13:11 0:0 76
selects any lcg command:
>>>> (stat,out)=commands.getstatusoutput('ps uaxwwwww | grep lcg')
>>>> lines=out.split('\n')
>>>> for n in range(12):
> ... l = lines[n]
> ... print l[:61], len(l)
> ...
> root 2382 0.0 0.0 4264 1668 ? S Oct05 1:4 186
> root 2384 0.0 0.0 4260 1660 ? S Oct05 1:4 192
> root 2422 0.0 0.0 4256 1660 ? S Oct05 1:4 194
> root 2547 0.0 0.0 4264 1668 ? S Oct05 1:5 195
> lcgbdii 21817 0.0 0.0 4192 1000 ? S Oct05 0:0 207
> lcgbdii 21850 0.0 0.0 5544 2156 ? S Oct05 0:0 155
> lcgbdii 12345 15.7 0.6 24096 20516 pts/1 S 12:40 5:1 130
> lcgbdii 12951 0.0 0.1 7816 4368 pts/1 S 12:40 0:0 207
> lcgbdii 20797 0.0 0.1 7980 4440 pts/1 S 12:59 0:0 207
> lcgbdii 20836 0.0 0.1 7980 4440 pts/1 S 13:00 0:0 207
> lcgbdii 20849 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207
> lcgbdii 20853 0.0 0.1 7980 4384 pts/1 S 13:00 0:0 207
Jeff Templon wrote:
> Hi
>
> what about these lines in bdii.conf?? the second one looks very suspicious
>
> BDII_UPDATE_URL=http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf
>
> BDII_UPDATE_LDIF=http://
>
> JT
>
> Jeff Templon wrote:
>
>> Hi
>>
>> very strange; i checked the conf file and the NS is configured to use
>> bosheks (the RB machine itself) as BDII.
>>
>> if i do an ldapsearch to bosheks
>>
>> ldapsearch -h bosheks.nikhef.nl -p 2170 -x -b "o=grid"
>>
>> the command prints a bit of header information and then just stops.
>> command prompt doesn't come back, and neither does the ream of site
>> information. this would certainly cause the WM to slow down.
>>
>> Really strange because I typed it twice, the first time it ran fine,
>> the second time it hung.
>>
>> a /etc/init.d/bdii restart
>>
>> appears not to help much, and the log files don't either ...
>>
>> from bdii-fwd.log:
>>
>>> 20051108_124543 Now forwarding to port 2173 (genNr 3)
>>> 20051108_124543 Reaped process 15186 (genNr 2)
>>> 20051108_124544 [Connect from 192.16.186.252:36107]
>>> 20051108_124544 [Connecting to localhost...done]
>>> 20051108_124544 Forked process 15821 -> 2173
>>> 20051108_124546 Reaped process 15821 (genNr 3)
>>> 20051108_124717 Now forwarding to port 2171 (genNr 4)
>>> 20051108_124717 [Connect from 192.16.186.252:36291]
>>> 20051108_124717 [Connecting to localhost...done]
>>> 20051108_124717 Forked process 16436 -> 2171
>>> 20051108_124722 Reaped process 16436 (genNr 4)
>>> 20051108_124944 Now forwarding to port 2172 (genNr 5)
>>> 20051108_124944 [Connect from 192.16.186.252:36653]
>>> 20051108_124944 [Connecting to localhost...done]
>>> (
>>
>>
>>
>> and from bdii.log
>>
>>> TRIUMF-GC-LCG2: ldap_bind: Can't contact LDAP server
>>> ru-Novgorod-NOVSU-LCG2: ldap_bind: Can't contact LDAP server
>>> obsARMuk: ldap_bind: Can't contact LDAP server
>>> Time for searches: 34 s
>>> Time to sort: 1 s
>>> Time to update DB: 4 s
>>> Grabbing port 2170 for 2173
>>> Tue Nov 8 12:50:29 CET 2005
>>
>>
>>
>> hmm the obsARMuk looks a bit suspicious.
>>
>> I will keep looking, any help appreciated.
>>
>> JT
>>
>> David Smith wrote:
>>
>>> On Tue, 8 Nov 2005, Jeff Templon wrote:
>>>
>>>
>>>> our RB (bosheks) has been acting rather strangely today. from the
>>>> WM daemon
>>>> log file:
>>>
>>>
>>>
>>> [...]
>>>
>>> Hi Jeff,
>>>
>>> In principal the Timed out message means that there was a very slow
>>> response from the BDII. (The query and response are normally on the 1
>>> second timescale). Which BDII does bosheks use? I notice that the
>>> problem
>>> seemed to appear at about 10:39 this morning and seems to have gone
>>> again
>>> since about 11:25. Do you know of any changes on the BDII between those
>>> times?
>>>
>>> Yours,
>>> David
>>>
|