Incidentally, Liverpool is now showing an alarm on the dashboard (ARC-CE-srm-ops on hepgrid2.ph.liv.ac.uk):
https://snf-702430.vm.okeanos.grnet.gr/nagios/cgi-bin/status.cgi?host=hepgrid2.ph.liv.ac.uk
It's showing an age of four hours, but certainly wasn't visible this morning when we were discussing this; however, it's reasonably common for alarms to suddenly pop up on the dashboard with ages suggesting they've existed for hours or even days. (Of course, it's possible that the underlying alarm was present but that it wasn't reported on the ROD dashboard, which is really the bit I care about.)
Gordon
-----Original Message-----
From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Stephen Jones
Sent: 13 July 2016 12:11
To: [log in to unmask]
Subject: Re: Arc Outage
Thanks Gordon,
Since the plot shows our ARC Server was PURPLE, and it didn't show up on the dashboard, this indicates a disconnect somewhere. Areas to be suspicious would be:
a) Is alarm carried over to the dashboard only at the onset of the condition? (i.e. interrupt)
b) Or periodically thereafter? (i.e. polling)
If (a), then if the start of the condition is missed, it stays "missed"
for ever, or at least until it condition becomes right, then fails again. That's a problem with event driven systems.
And if (b), then it seems that the connection, data or polling logic is otherwise flaky.
So it looks like something's fishy. Suggest we characterise this system for a bit.
Cheers,
Ste
On 07/13/2016 11:54 AM, Gordon Stewart wrote:
> Hi Steve,
>
> I'm on duty this week, and I don't recall seeing any alarms for Liverpool yesterday, and certainly nothing which persisted long enough for me to think about notifying / ticketing. The vast majority of current alarms are related to the GridPP Nagios instances going away.
>
>
> Gordon
>
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]] On Behalf Of Daniela Bauer
> Sent: 13 July 2016 11:40
> To: [log in to unmask]
> Subject: Re: Arc Outage
>
> Hi Steve,
>
> There has been a problem with alarms not going to the dashboard, and
> even though the dashboard people now claim it's fixed, I get the
> impression that alarms that didn't reach the dashboard while it was
> broken are not picked up now either, only new alarms are. You (plural
> you) might just want to check your site on http://argo.egi.eu to see if there are any residual issues you are not aware of.
>
> Cheers,
> Daniela
>
>
> On 13 July 2016 at 11:28, Stephen Jones <[log in to unmask]> wrote:
>> Hi Kashif,
>>
>> (note to Raj and Alessandra below)
>>
>> I was sitting here wondering why there are so few ATLAS of LHCB jobs
>> coming to our ARC/Condor cluster at Liverpool.
>>
>> So I had a dig about, and restarted the services on the ARC/Condor
>> headnode (for no real reason) at 12th July, 16:30.
>>
>> Today, there are still no jobs from those (although SNOPLUS and NA62
>> are getting plenty of run time! )
>>
>> So I had a look at this new website you mentioned, http://argo.egi.eu
>>
>> I can see from the plot (which is attached) that our ARC Server was
>> PURPLE until 16:30 yesterday, then it went GREEN.
>>
>> This happened at the time I restarted the services, so I assume
>> something must have been stuck.
>>
>> Anyway, was this problem seen on the Dashboard? Could I have been notified?
>>
>> Also, for Raj and Alessandra: are ATLAS and LHCB hooked up to this
>> alarm system? Have they stopped sending jobs here? When will the resume?
>>
>> Cheers for all your help,
>>
>> Ste
>>
>> Liverpool
>>
>>
>>
>>
>>
>>
>> --
>> Steve Jones [log in to unmask]
>> Grid System Administrator office: 220
>> High Energy Physics Division tel (int): 43396
>> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
>> University of Liverpool http://www.liv.ac.uk/physics/hep/
>>
>
>
> --
> Sent from the pit of despair
>
> -----------------------------------------------------------
> [log in to unmask]
> HEP Group/Physics Dep
> Imperial College
> London, SW7 2BW
> Tel: +44-(0)20-75947810
> http://www.hep.ph.ic.ac.uk/~dbauer/
--
Steve Jones [log in to unmask]
Grid System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/
|