

Hi John,

 From France, we also noticed that most of sft-lcg-* failures appears 
and then disappears without any action by the sites. This problem has 
already been raised at various weekly operations meetings, and I think 
that Sven was the last one to complain of this two weeks ago.

According to me, there are 2 problems to solve here:
- Make clearer the SFT failure reason, and in particular, point at the 
real source of the failure rather than systematically imply that failure 
comes from site. It would certainly require to make clearer the 
middleware errors handling.
- Improve the robustness of the lcg-* commands... of course.


Gordon, JC (John) a écrit :

>It's not just Irish sites. I see UK sites which fail one or two tests
>sft-lcg-rm then pass again without any action by the site. Our diagnosis
>is that it is  failure to information service which causes this. 
>Does no-one else see this?
>>-----Original Message-----
>>From: LHC Computer Grid - Rollout 
>>[mailto:[log in to unmask]] On Behalf Of 
>>Maarten Litmaath
>>Sent: 28 November 2005 13:18
>>To: [log in to unmask]
>>Subject: Re: [LCG-ROLLOUT] sites Failing SFT lcg-rm tests
>>Stephen Childs wrote:
>>>Maarten Litmaath wrote:
>>> > Might your sites be suffering from the 15s query timeout 
>>in lcg-utils?
>>> > How good is the connectivity to
>>> >
>>>Could you give me a sample ldapsearch string that is 
>>representative of 
>>>what the lcg-utils do to test this?
>>ldapsearch -x -h -b o=grid \
>>     '(&(GlueServiceType=*)(GlueServiceAccessControlRule=dteam))'
>>>I just ran the following command:
>>>ldapsearch -x -h -p 2170 -b 
>>>50 times from one of our slower sites and it seems as if there are 
>>>occasions when it takes >1 minute to get this information. 
>>(However a 
>>>quick check at a couple of other sites didn't show such 
>>long times.) 
>>>If the problem is at the CERN end, it might explain why the RM 
>>>failures happen intermittently?
>>I will have a look at lcg-bdii, but if the trouble is mostly 
>>with the Irish sites, I suspect there is something clunky in 
>>the old middleware or there is a connectivity problem.

Grid Computing Team Member
IN2P3/CNRS Computing Centre - Lyon (FRANCE)
Tel. +33 | Fax. +33 | e-mail: [log in to unmask]