This thread makes me think we should have a central tracking of this
kind of service intervention, so that we can gather statistics on where
the problems are. We could use the problem tracking system via a
specfic category where any admin on any site that has to intervene
(stop, restart, kill, reboot, ...) any service would log the action.
Ian
> -----Original Message-----
> From: Markus Schulz
> Sent: 03 December 2003 13:40
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Three monitored RBs all down
>
> Hi,
> I didn't restart the RB.
> Maybe it was Jose for his tests.
> markus
> On Wednesday, Dec 3, 2003, at 11:14 Europe/Zurich, Emanuele LEONARDI
> wrote:
>
> > Hi Trevor.
> >
> > The NetworkServer on CERN RB was restarted this morning at
> 9:10. I did
> > not do it, so I guess this was Markus from home.
> >
> > Emanuele
> >
> > Daniels, T (Trevor) wrote:
> >> Thanks Gergo, Martin and Emanuele? for your prompt attention - all
> >> three RBs
> >> are now functioning again.
> >>
> >> Trevor
> >> .lf n25
> >>
> >> Dr Trevor Daniels
> >> c/o CCLRC eSC Department Phone: (+44)|(0)
> 1235 778093
> >> Rutherford Appleton Laboratory Fax: (+44)|(0)
> 1235 446626
> >> Chilton, DIDCOT, Oxon, OX11 0QX, UK Email:
> [log in to unmask]
> >> The contents of this email are sent in confidence for the
> use of the
> >> intended recipient only. If you are not one of the intended
> >> recipients do
> >> not take action on it or show it to anyone else, but return this
> >> email to
> >> the sender and delete your copy of it.
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Debreczeni Gergely [mailto:[log in to unmask]]
> >>> Sent: Wednesday, December 03, 2003 9:44 AM
> >>> To: [log in to unmask]
> >>> Subject: Re: [LCG-ROLLOUT] Three monitored RBs all down
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I couldn't figure out why our RB has died.
> >>> Restart/reboot didn't help , I deleted and reconfigured
> >>> lbserver20, now it seems to work... ?
> >>> Gergo
> >>>
> >>>
> >>>
> >>> On Wed, 3 Dec 2003, Bly, MJ (Martin) wrote:
> >>>
> >>>
> >>>> It isn't simple problem - restarting our RB doesn't make
> it perform
> >>>> properly. Nor do the various tricks that seemed to cure
> its various
> >>>> ailments and corruptions a few weeks back. Steve Traylen
> >>>
> >>> is currently
> >>>
> >>>> looking at it now with malice aforethought...
> >>>>
> >>>> I notice that the CERN CA is failing to respond to the
> >>>
> >>> edg-fetch-crl-cron
> >>>
> >>>> jobs so generating a CA certificate verify failure.
> >>>>
> >>>> Martin.
> >>>> --
> >>>> -------------------------------------------------------
> >>>> Martin Bly | +44 1235 446981 | [log in to unmask]
> >>>> Systems Admin, Tier 1/A Service, RAL PPD CSG
> >>>> -------------------------------------------------------
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Ian Bird [mailto:[log in to unmask]]
> >>>>> Sent: Wednesday, December 03, 2003 9:13 AM
> >>>>> To: [log in to unmask]
> >>>>> Subject: Re: [LCG-ROLLOUT] Three monitored RBs all down
> >>>>>
> >>>>>
> >>>>> Well if that's the limit of what an RB can do then we'll
> >>>>
> >>> be better off
> >>>
> >>>>> without them! We should really try and get an idea of what
> >>>>> is going on
> >>>>> with them, but is it not this continuing issue that the RB
> >>>>> just crashes
> >>>>> and has to be restarted?
> >>>>>
> >>>>> Ian
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Daniels, T (Trevor) [mailto:[log in to unmask]]
> >>>>>> Sent: 03 December 2003 10:01
> >>>>>> To: [log in to unmask]
> >>>>>> Subject: [LCG-ROLLOUT] Three monitored RBs all down
> >>>>>>
> >>>>>> The GppMon monitor currently uses three RBs to submit jobs
> >>>>>> every hour to all
> >>>>>> CEs. At the moment all three RBs are failing to process jobs.
> >>>>>>
> >>>>>> The RAL RB has not processed jobs since 15:00 on 26 Nov (but
> >>>>>> I know they
> >>>>>> currently have more urgent problems to attend to)
> >>>>>> The CERN RB failed around 18:00 UTC on 2 Dec
> >>>>>> The Budapest RB failed around 20:00 UTC on 2 Dec.
> >>>>>>
> >>>>>> I wonder to what extent the GppMon load, which is now around
> >>>>>> 28 jobs an hour
> >>>>>> through each RB, is causing these failures?
> >>>>>>
> >>>>>> Trevor
> >>>>>> .lf n25
> >>>>>> Dr Trevor Daniels
> >>>>>> c/o CCLRC eSC Department Phone:
> >>>>>
> >>> (+44)|(0) 1235 778093
> >>>
> >>>>>> Rutherford Appleton Laboratory Fax:
> >>>>>
> >>> (+44)|(0) 1235 446626
> >>>
> >>>>>> Chilton, DIDCOT, Oxon, OX11 0QX, UK Email:
> >>>>>
> >>> [log in to unmask]
> >>>
> >>>>>> The contents of this email are sent in confidence for
> >>>>>
> >>> the use of the
> >>>
> >>>>>> intended recipient only. If you are not one of the intended
> >>>>>> recipients do
> >>>>>> not take action on it or show it to anyone else, but return
> >>>>>> this email to
> >>>>>> the sender and delete your copy of it.
> >>>>>>
> >>>>>
> >
> >
> > --
> > /------------------- Emanuele Leonardi -------------------\
> > | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
> > | IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
> > \---------------------------------------------------------/
> >
> >
>
> **************************************************************
> **********
> *******
> Markus Schulz
> CERN IT
>
|