Hi Peter,
It turns out that crond on our ARC/Condor Server had crashed.
It's running now. Perhaps it will run fetch-crls and things might start up.
That's only a guess, but many auth errors in the gridftp log have dried up.
Let's see what happens now.
Cheers,
Ste
On 07/13/2016 11:59 AM, Love, Peter wrote:
> Steve,
>
> For an insight into atlas please see here and click on the 'fault' item:
> http://apfmon.lancs.ac.uk/q/UKI-NORTHGRID-LIV-HEP_SL6
>
>
> 026 (969175.000.000) 07/13 07:23:09 Detected Down Grid Resource
> GridResource: nordugrid hepgrid2.ph.liv.ac.uk
>
>
> But also for your cream ce:
> 009 (13085502.000.000) 07/13 12:36:23 Job was aborted by the user.
> CREAM error: CREAM_JOB_REGISTER timed out
>
> Network issue?
>
> Cheers,
> Peter
>
>
>> On 13 Jul 2016, at 11:54, Gordon Stewart <[log in to unmask]> wrote:
>>
>> Hi Steve,
>>
>> I'm on duty this week, and I don't recall seeing any alarms for Liverpool yesterday, and certainly nothing which persisted long enough for me to think about notifying / ticketing. The vast majority of current alarms are related to the GridPP Nagios instances going away.
>>
>>
>> Gordon
>>
>>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Daniela Bauer
>> Sent: 13 July 2016 11:40
>> To: [log in to unmask]
>> Subject: Re: Arc Outage
>>
>> Hi Steve,
>>
>> There has been a problem with alarms not going to the dashboard, and even though the dashboard people now claim it's fixed, I get the impression that alarms that didn't reach the dashboard while it was broken are not picked up now either, only new alarms are. You (plural
>> you) might just want to check your site on http://argo.egi.eu to see if there are any residual issues you are not aware of.
>>
>> Cheers,
>> Daniela
>>
>>
>> On 13 July 2016 at 11:28, Stephen Jones <[log in to unmask]> wrote:
>>> Hi Kashif,
>>>
>>> (note to Raj and Alessandra below)
>>>
>>> I was sitting here wondering why there are so few ATLAS of LHCB jobs
>>> coming to our ARC/Condor cluster at Liverpool.
>>>
>>> So I had a dig about, and restarted the services on the ARC/Condor
>>> headnode (for no real reason) at 12th July, 16:30.
>>>
>>> Today, there are still no jobs from those (although SNOPLUS and NA62
>>> are getting plenty of run time! )
>>>
>>> So I had a look at this new website you mentioned, http://argo.egi.eu
>>>
>>> I can see from the plot (which is attached) that our ARC Server was
>>> PURPLE until 16:30 yesterday, then it went GREEN.
>>>
>>> This happened at the time I restarted the services, so I assume
>>> something must have been stuck.
>>>
>>> Anyway, was this problem seen on the Dashboard? Could I have been notified?
>>>
>>> Also, for Raj and Alessandra: are ATLAS and LHCB hooked up to this
>>> alarm system? Have they stopped sending jobs here? When will the resume?
>>>
>>> Cheers for all your help,
>>>
>>> Ste
>>>
>>> Liverpool
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Steve Jones [log in to unmask]
>>> Grid System Administrator office: 220
>>> High Energy Physics Division tel (int): 43396
>>> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
>>> University of Liverpool http://www.liv.ac.uk/physics/hep/
>>>
>>
>>
>> --
>> Sent from the pit of despair
>>
>> -----------------------------------------------------------
>> [log in to unmask]
>> HEP Group/Physics Dep
>> Imperial College
>> London, SW7 2BW
>> Tel: +44-(0)20-75947810
>> http://www.hep.ph.ic.ac.uk/~dbauer/
--
Steve Jones [log in to unmask]
Grid System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/
|