Hi,
I missed this mail the first time around - my desktop box died and
I now have to read much of my mail via a web interface :-(
Wasn't David talking about a classic SE?
For an SRM the Right Thing(tm) is to prevent further puts and gets
from being handled, not to prevent new GridFTP connections. And
not by blocking them but by returning an error.
AFAIK, our SRMs aren't smart enough to do that, despite the fact
that the use case was known since EDG, as Stephen mentioned.
As Owen mentioned, it should be better to take it out of the
information system. Since most tools query the information system,
that would prevent new transfers. Clients that have outstanding stuff
going on will hopefully remember the endpoint long enough to complete
the transaction. OTOH, that should be sufficient, so there shouldn't
be a need to muck around with doors and stuff.
Anyone tested this?
Can I recommend that people add to the wiki:
https://wiki.gridpp.ac.uk/wiki/SE_Shutdown
I see no changes since the first quick stab I had at writing the page.
Cheers,
--jens
-----Original Message-----
From: Testbed Support for GridPP member institutes on behalf of Owen Synge
Sent: Thu 30/03/2006 11:15 AM
To: [log in to unmask]
Subject: Re: procedure for rebooting SE?
I think this is a very important point, I like the idea of incrementally
shutting down over a period of time, starting with the information
system then safely the data transfer doors, then the SRM so people can
always "srm put done". I may need some D-Cache developer approval and help
getting the state of transfers through doors, but I have little doubt
they would be happy to help me.
If I was to write a script to set the information system to reboot
prepare and then shutdown D-Cache in safe stages people would want it
enough to get the FTS guys, and the lcg-cr maintainers to check the
information system?
I will try and kick off some discussion and see if everyone agree's
this is very important in D-Cache land.
Regards
Owen
On Thu, 30 Mar 2006 11:02:41 +0100
"Brew, CAJ (Chris)" <[log in to unmask]> wrote:
> Hi,
>
> I cannot find anything about it now but I seem to recall someone had
> some proceedure whereby they used iptables to block new connections to
> some port (2811 possibly) for a while before the reboot so no new
> transfers could be started. Of course that doesn't help the jobs that
> die because they could not access the files.
>
> Yours,
> Chris.
>
> > -----Original Message-----
> > From: Testbed Support for GridPP member institutes
> > [mailto:[log in to unmask]] On Behalf Of Burke, S (Stephen)
> > Sent: 29 March 2006 14:55
> > To: [log in to unmask]
> > Subject: Re: procedure for rebooting SE?
> >
> > Testbed Support for GridPP member institutes
> > > [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
> > > said: CMS are modifying phedex to use FTS as well.
> >
> > However, that doesn't mean that everything uses FTS, there
> > will still be a lot of file access via lcg-cr and friends, or
> > indeed directly with gridftp. I don't think there is any
> > robust way to drain an SE, although like many things it's
> > been on the wish list for a long time (the EDG bugzilla is
> > now closed so I can't point people at my entries there any
> > more!). Probably the best you can do is take it out of the
> > info system some time before you turn it off, and perhaps
> > point the close SE somewhere else. In theory the new
> > GlueServiceStatus table/object has a flag to say whether a
> > service is working or not, but I doubt that anything looks at it.
> >
> > > On the whole though, it would be far more disruptive to
> > take the whole
> > > site down just for a 5 min reboot. With current job success
> > rates on
> > > the grid, a rare transient job failure from an SE reboot is not
> > > terribly significant.
> >
> > Possibly true, but we should at least be *trying* to improve!
> >
> > Stephen
> >
|