Hi John
As I said, it raises the question. While in this case it is just ops
failing and nobody would have been
called out, I can imagine other cases where something is spotted over
the weekend and a callout
would help to provide a better service to users. It is not clear to me
that we currently know how to
escalate operational problems other than through this email list.
Reversely a communication from the T1 (more likely since
the T1 is actively monitored out of hours) that we have a UK wide
problem when known about - even if just with the
monitoring - could be useful to reduce time spent by each T2 admin
investigating their own failures.
Catalin - thanks for responding that the problem is known. For how
long did the previous
occurrences last and what was the response from CERN when it was
questioned last time?
regards,
Jeremy
On 12 Jun 2010, at 23:37, John Gordon wrote:
> Jeremy, I do not believe that UK ops people can submit alarm tickets.
> Technically it just needs DNs adding to a list but whether they should
> or not is another matter. Keeping the experiment data flows going from
> CERN to a T1 is the justification for getting someone at the T1 out of
> bed. Is a UK site failing a SAM test of the same magnitude?
>
> As we have heard from Catalin, the T1 was called out so the T1
> monitoring seems up to be working.
>
> Regards,
>
> John
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]] On Behalf Of J Coles
> Sent: 12 June 2010 17:11
> To: [log in to unmask]
> Subject: Re: lcg-bdii.gridpp.ac.uk problem?
>
> Hi Elena
>
> That supports the suspicion of lcg-bdii.gridpp.ac.uk I suspect IC and
> RALPP (and probably Bristol) were already set to look at alternate
> BDIIs. The Glasgow BDII has recovered from whatever problem it
> suffered earlier and now the Glasgow site is also passing again. Since
> the LHC VO results are still fine, I only created a GGUS ticket with
> top-priority (https://gus.fzk.de/ws/ticket_info.php?ticket=58990) - it
> raises a question as to whether our ops people can anyway submit alarm
> tickets to the T1 like the experiment ops people. I thought the T1
> triggered a call out after 2 successive ops VO failures anyway and
> since they are affected too....Something to discuss next week.
>
> Cheers,
> Jeremy
>
> On 12 Jun 2010, at 16:27, Elena Korolkova wrote:
>
>> Hi Jeremy
>>
>> I just changed LCG_GFAL_INFOSYS to "bdii.ce-egee.org and we passed
>> the last SAM test.
>>
>> Elena
>>
>>
> ________________________________________________________________________
> ____
>> Dr Elena Korolkova
>> Email: [log in to unmask]
>> Tel.: +44 (0)114 2223553
>> Fax: +44 (0)114 2223555
>> Department of Physics and Astronomy
>> University of Sheffield
>> Sheffield, S3 7RH, United Kingdom
>>
>> On Sat, 12 Jun 2010, J Coles wrote:
>>
>>> Hi Wahid
>>>
>>> The history here shows problems for the Glasgow BDII but not
> lcg-bdii.gridpp.ac.uk
>>> : http://pprc.qmul.ac.uk/~lloyd/gridpp/bdiitest.html.
>>>
>>> This view (from gstat2 that everyone at HEPSYSMAN yesterday will
>>> know about):
> http://gstat-prod.cern.ch/gstat/service/bdii_top/treeview/lcg-bdii.gridp
> p.ac.uk/
>>> also shows things to be okay (for now at least).
>>>
>>> There are some sites passing:
> http://pprc.qmul.ac.uk/~lloyd/gridpp/samtest.html
>>> (i.e. IC ... RALPP) . All others fail with ERROR: CE-sft-lcg-rm-
>>> rep with
>>>
>>> CRITICAL: METRIC FAILED [org.sam.WN-RepRep-/ops/Role=lcgadmin]:
>>> CRITICAL: File was NOT replicated to SE samdpm002.cern.ch. [ErrDB:
>>> [('lcg_util_wn', 'server', 'CRITICAL')]]
>>> org.sam.WN-RepCr-/ops/Role=lcgadmin
>>>
>>> Since other countries do not see the problem I tend to agree that
>>> it suggests a core UK problem, but the monitoring results are not
>>> clear (for me at least). How come ralpp and IC continue to pass the
>>> org.sam.WN-Rep-/ops/Role=lcgadmin service test? Perhaps one of the
>>> on-duty people can comment as I must be missing something.
>>>
>>> Jeremy
>>>
>>>
>>>
>>>
>>>
>>> On 12 Jun 2010, at 09:42, Wahid Bhimji wrote:
>>>
>>>> Hi
>>>> Looks like a number of sites are failing sam tests due to a
>>>> problem with lcg-bdii.gridpp.ac.uk.
>>>> Could someone take a look
>>>> Ta
>>>> Wahid
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
|