Hi Andrews (Washbrook, Lahiff, ...), all
Re: ARC Brainstorming Camp Summary
In that talk by Andrew W, on ARC, I found this statement:
> Sometimes a-rex or grid-manager locks up. We have to detect when the
gm-heartbeat file is stale, then restart by hand.
This is also the case at Liverpool.
Until recently, we had ~ 1000 slots, and it happened (say) every month
or two.
Lately, I added some nodes that put it up to 1330 slots.
Now it happens every couple of days.
So I'll have to also "detect when the gm-heartbeat file is stale, then
restart".
It's becoming a pest. What do people know about this problem?
Cheers,
Ste
That talk:
https://indico.cern.ch/event/594508/attachments/1387782/2112742/ajw-ARC-131216.pdf
--
Steve Jones [log in to unmask]
Grid System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/
|