I think this is a very important point, I like the idea of incrementally
shutting down over a period of time, starting with the information
system then safely the data transfer doors, then the SRM so people can
always "srm put done". I may need some D-Cache developer approval and help
getting the state of transfers through doors, but I have little doubt
they would be happy to help me.
If I was to write a script to set the information system to reboot
prepare and then shutdown D-Cache in safe stages people would want it
enough to get the FTS guys, and the lcg-cr maintainers to check the
information system?
I will try and kick off some discussion and see if everyone agree's
this is very important in D-Cache land.
Regards
Owen
On Thu, 30 Mar 2006 11:02:41 +0100
"Brew, CAJ (Chris)" <[log in to unmask]> wrote:
> Hi,
>
> I cannot find anything about it now but I seem to recall someone had
> some proceedure whereby they used iptables to block new connections to
> some port (2811 possibly) for a while before the reboot so no new
> transfers could be started. Of course that doesn't help the jobs that
> die because they could not access the files.
>
> Yours,
> Chris.
>
> > -----Original Message-----
> > From: Testbed Support for GridPP member institutes
> > [mailto:[log in to unmask]] On Behalf Of Burke, S (Stephen)
> > Sent: 29 March 2006 14:55
> > To: [log in to unmask]
> > Subject: Re: procedure for rebooting SE?
> >
> > Testbed Support for GridPP member institutes
> > > [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
> > > said: CMS are modifying phedex to use FTS as well.
> >
> > However, that doesn't mean that everything uses FTS, there
> > will still be a lot of file access via lcg-cr and friends, or
> > indeed directly with gridftp. I don't think there is any
> > robust way to drain an SE, although like many things it's
> > been on the wish list for a long time (the EDG bugzilla is
> > now closed so I can't point people at my entries there any
> > more!). Probably the best you can do is take it out of the
> > info system some time before you turn it off, and perhaps
> > point the close SE somewhere else. In theory the new
> > GlueServiceStatus table/object has a flag to say whether a
> > service is working or not, but I doubt that anything looks at it.
> >
> > > On the whole though, it would be far more disruptive to
> > take the whole
> > > site down just for a 5 min reboot. With current job success
> > rates on
> > > the grid, a rare transient job failure from an SE reboot is not
> > > terribly significant.
> >
> > Possibly true, but we should at least be *trying* to improve!
> >
> > Stephen
> >
|