Print

Print


There are a lot of bdii-fwd processes hanging around ... i am not sure 
whether it is significant, but ps is reporting a lot of white space in 
the command:

> lcgbdii  20836  0.0  0.1  7980 4440 pts/1    S    13:00   0:00 bdii-fwd [192.16.186.252:37581 <-- 127.0.0.1:2172]                                                                                              \n

i count 94 space characters between the closing bracket and the 
end-of-line.  The commands all seem to be padded out to 207 characters, 
whereas other commands aren't:

selects only bdii-fwd commands:

>>>> (stat,out)=commands.getstatusoutput('ps uaxwwwww | grep bdii-fwd')
>>>> lines=out.split('\n')
>>>> for l in lines:
> ...     print l[:61], len(l)
> ... 
> lcgbdii  12951  0.0  0.1  7816 4368 pts/1    S    12:40   0:0 207
> lcgbdii  20797  0.0  0.1  7980 4440 pts/1    S    12:59   0:0 207
> lcgbdii  20836  0.0  0.1  7980 4440 pts/1    S    13:00   0:0 207
> lcgbdii  20849  0.0  0.1  7980 4384 pts/1    S    13:00   0:0 207
> lcgbdii  20853  0.0  0.1  7980 4384 pts/1    S    13:00   0:0 207
> lcgbdii  21420  0.0  0.1  7980 4384 pts/1    S    13:01   0:0 207
> lcgbdii  21425  0.0  0.1  7980 4384 pts/1    S    13:01   0:0 207
> lcgbdii  21430  0.0  0.1  7980 4384 pts/1    S    13:01   0:0 207
> lcgbdii  21437  0.0  0.1  7980 4384 pts/1    S    13:01   0:0 207
> lcgbdii  21447  0.0  0.1  7980 4384 pts/1    S    13:01   0:0 207
> lcgbdii  21455  0.0  0.1  7980 4384 pts/1    S    13:01   0:0 207
> lcgbdii  21494  0.0  0.1  7980 4384 pts/1    S    13:02   0:0 207
> lcgbdii  21495  0.0  0.1  7980 4384 pts/1    S    13:02   0:0 207
> lcgbdii  21498  0.0  0.1  7980 4384 pts/1    S    13:02   0:0 207
> root     24538  0.0  0.0  4196  984 pts/1    S    13:11   0:0 106
> root     24540  0.0  0.0  3680  664 pts/1    S    13:11   0:0 76

selects any lcg command:

>>>> (stat,out)=commands.getstatusoutput('ps uaxwwwww | grep lcg')
>>>> lines=out.split('\n')
>>>> for n in range(12):
> ...     l = lines[n]
> ...     print l[:61], len(l)
> ... 
> root      2382  0.0  0.0  4264 1668 ?        S    Oct05   1:4 186
> root      2384  0.0  0.0  4260 1660 ?        S    Oct05   1:4 192
> root      2422  0.0  0.0  4256 1660 ?        S    Oct05   1:4 194
> root      2547  0.0  0.0  4264 1668 ?        S    Oct05   1:5 195
> lcgbdii  21817  0.0  0.0  4192 1000 ?        S    Oct05   0:0 207
> lcgbdii  21850  0.0  0.0  5544 2156 ?        S    Oct05   0:0 155
> lcgbdii  12345 15.7  0.6 24096 20516 pts/1   S    12:40   5:1 130
> lcgbdii  12951  0.0  0.1  7816 4368 pts/1    S    12:40   0:0 207
> lcgbdii  20797  0.0  0.1  7980 4440 pts/1    S    12:59   0:0 207
> lcgbdii  20836  0.0  0.1  7980 4440 pts/1    S    13:00   0:0 207
> lcgbdii  20849  0.0  0.1  7980 4384 pts/1    S    13:00   0:0 207
> lcgbdii  20853  0.0  0.1  7980 4384 pts/1    S    13:00   0:0 207  



Jeff Templon wrote:
> Hi
> 
> what about these lines in bdii.conf?? the second one looks very suspicious
> 
> BDII_UPDATE_URL=http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf 
> 
> BDII_UPDATE_LDIF=http://
> 
>                 JT
> 
> Jeff Templon wrote:
> 
>> Hi
>>
>> very strange; i checked the conf file and the NS is configured to use 
>> bosheks (the RB machine itself) as BDII.
>>
>> if i do an ldapsearch to bosheks
>>
>>    ldapsearch -h bosheks.nikhef.nl -p 2170 -x -b "o=grid"
>>
>> the command prints a bit of header information and then just stops. 
>> command prompt doesn't come back, and neither does the ream of site 
>> information.  this would certainly cause the WM to slow down.
>>
>> Really strange because I typed it twice, the first time it ran fine, 
>> the second time it hung.
>>
>> a /etc/init.d/bdii restart
>>
>> appears not to help much, and the log files don't either ...
>>
>> from bdii-fwd.log:
>>
>>> 20051108_124543 Now forwarding to port 2173 (genNr 3)
>>> 20051108_124543 Reaped process 15186 (genNr 2)
>>> 20051108_124544 [Connect from 192.16.186.252:36107]
>>> 20051108_124544 [Connecting to localhost...done]
>>> 20051108_124544 Forked process 15821 -> 2173
>>> 20051108_124546 Reaped process 15821 (genNr 3)
>>> 20051108_124717 Now forwarding to port 2171 (genNr 4)
>>> 20051108_124717 [Connect from 192.16.186.252:36291]
>>> 20051108_124717 [Connecting to localhost...done]
>>> 20051108_124717 Forked process 16436 -> 2171
>>> 20051108_124722 Reaped process 16436 (genNr 4)
>>> 20051108_124944 Now forwarding to port 2172 (genNr 5)
>>> 20051108_124944 [Connect from 192.16.186.252:36653]
>>> 20051108_124944 [Connecting to localhost...done]
>>> (
>>
>>
>>
>> and from bdii.log
>>
>>> TRIUMF-GC-LCG2: ldap_bind: Can't contact LDAP server
>>> ru-Novgorod-NOVSU-LCG2: ldap_bind: Can't contact LDAP server
>>> obsARMuk: ldap_bind: Can't contact LDAP server
>>> Time for searches: 34 s
>>> Time to sort: 1 s
>>> Time to update DB: 4 s
>>> Grabbing port 2170 for 2173
>>> Tue Nov  8 12:50:29 CET 2005 
>>
>>
>>
>> hmm the obsARMuk looks a bit suspicious.
>>
>> I will keep looking, any help appreciated.
>>
>>                 JT
>>
>> David Smith wrote:
>>
>>> On Tue, 8 Nov 2005, Jeff Templon wrote:
>>>
>>>
>>>> our RB (bosheks) has been acting rather strangely today.  from the 
>>>> WM daemon
>>>> log file:
>>>
>>>
>>>
>>> [...]
>>>
>>> Hi Jeff,
>>>
>>> In principal the Timed out message means that there was a very slow
>>> response from the BDII. (The query and response are normally on the 1
>>> second timescale). Which BDII does bosheks use? I notice that the 
>>> problem
>>> seemed to appear at about 10:39 this morning and seems to have gone 
>>> again
>>> since about 11:25.  Do you know of any changes on the BDII between those
>>> times?
>>>
>>> Yours,
>>> David
>>>