Hi,
On Mon, 28 Nov 2005, Gordon, JC (John) wrote:
> It's not just Irish sites. I see UK sites which fail one or two tests
> sft-lcg-rm then pass again without any action by the site. Our diagnosis
> is that it is failure to information service which causes this.
>
> Does no-one else see this?
I've been fighting this issue for almost half a year now. There is a fix
for the bdii-update startup script (thanks Maarten), and it did improve
the situation for me, but the problem is still there. During a fraction
of a BDII update cycle the BDII node is very unresponsive. As was said
already, this problem is hard to track down because network latency might
also be involved. Note, however, that we have our own BDII node here, so
WN->BDII latency is definitely not the issue in our case.
As for the better error messages in LCG-2_6_0, I have failed so far to
spot any of them (yes, it still says 'invalid argument').
Szabolcs
> John
>
>> -----Original Message-----
>> From: LHC Computer Grid - Rollout
>> [mailto:[log in to unmask]] On Behalf Of
>> Maarten Litmaath
>> Sent: 28 November 2005 13:18
>> To: [log in to unmask]
>> Subject: Re: [LCG-ROLLOUT] sites Failing SFT lcg-rm tests
>>
>> Stephen Childs wrote:
>>
>>> Maarten Litmaath wrote:
>>>
>>> > Might your sites be suffering from the 15s query timeout
>> in lcg-utils?
>>> > How good is the connectivity to lcg-bdii.cern.ch?
>>> >
>>> Could you give me a sample ldapsearch string that is
>> representative of
>>> what the lcg-utils do to test this?
>>
>> ldapsearch -x -h lcg-bdii.cern.ch:2170 -b o=grid \
>> '(&(GlueServiceType=*)(GlueServiceAccessControlRule=dteam))'
>>
>>> I just ran the following command:
>>>
>>> ldapsearch -x -h lcg-bdii.cern.ch -p 2170 -b
>>> 'mds-vo-name=giAITie,mds-vo-name=local,o=grid'
>>>
>>> 50 times from one of our slower sites and it seems as if there are
>>> occasions when it takes >1 minute to get this information.
>> (However a
>>> quick check at a couple of other sites didn't show such
>> long times.)
>>> If the problem is at the CERN end, it might explain why the RM
>>> failures happen intermittently?
>>
>> I will have a look at lcg-bdii, but if the trouble is mostly
>> with the Irish sites, I suspect there is something clunky in
>> the old middleware or there is a connectivity problem.
>>
>
|