O/H Grid Admin έγραψε:
> Hi Sérgio and all,
>
> My site (IEETA) is having the same problem. I do not have much info to
> share (still working on it) but I would like to launch to discussion
> this ticket. Please, take a look:
>
> https://gus.fzk.de/ws/ticket_info.php?ticket=45357 (
> https://savannah.cern.ch/bugs/?46043 )
>
> Is it the same problem?
> **
> Maybe it isn't, but sill I agree with Fotios Georgatos:
>
> "IMHO,
> The SAM tests is a great tool to catch site errors but, it is not yet
> mature
> enough to be the source of Grid Site Reliability metrics, because its
> sensors
> have to be more elaborate about when and how they trigger. In particular,
> we miss a "network sanity" sensor(s) so that we know where to look at.
> (Current deployment is as if the TCP/IP stack never fails...)
> I believe this is inadequate, at a moment where sites are asked to sign
> SLAs...
> ...and it also has a high impact on sysadmins time, as a whole.
>
> Let's get this forward, because the SAM tests can be a great tool for all."
Couldn't agree more, see for example
https://gus.fzk.de/ws/ticket_info.php?ticket=46735. Unfortunately
because of the impact on sysadmins time and the enormours time and
effort consumed in dissecting out-of-site SAM problems, not enough
people bother complaining. And without complains noone is doing anything
to deal with that.
Sites sign SLAs, while at the same time they depend on (example) network
connectivity which is covered by different (or no) SLA.
SAM (or whatever) probes assume that alla erros are the sites fault. In
fact every probe does not test only the site but a bunch of different
infrastructures starting from the SAM infrastructure itself,
international connectivity links, routers...the list goes on. So the
tester is also tested.
I may be missing the whole picture, but I find the concept of using
suchs probes even as mere indication (let alone accurate calculation) of
*site* availability without taking into account say TCP stack falures as
you mention, severely flawed.
On the other hand are a great tool to know something is not working all
the way from CERN to a site. Finding out what that can be requires from
me around 15 days of opening ticket, escalation, debugging, and that if
we are lucky enough to find a compehensive error message somewhere or
someone heard about a network problem/mainteanance somewhere....50% of
such kind of problems are never actually dissected.
Cheers,
>
>
>
> Best Regards
>
> Luis
>
> Sérgio Afonso wrote:
>> Hi *,
>>
>> Since this late morning, from 5AM further, we start failing SAM tests
>> (CE-sft-lcg-rm-rep) with:
>>
>> ------------
>> [SE][Mkdir][] httpg://lxdpm104.cern.ch:8446/srm/managerv2: CGSI-gSOAP
>> running on grid011.up.pt reports could not open connection to
>> lxdpm104.cern.ch
>> ------------
>>
>> I already updated to the last gLite Update and reconfigured the
>> Storage Element but the problem persists.
>>
>> Can anyone try to point me where should i look to solve this issue?
>>
>> Thanks in advance,
>> Sérgio Afonso
>
--
=============================================================================
Dimitris Zilaskos
GridAUTH Operations Centre @ Aristotle University of Thessaloniki , Greece
Tel: +302310998988 Fax: +302310994309
http://www.grid.auth.gr
=============================================================================
|