I guess I am stuck on the fence about this - which I acknowledge isn't a
useful, helpful or confortable place to be.
Basically i am uneasy about glexec for the reasons outlined but
pragmatically may be prepared to run it if it was essential to do so.
However despite reading all the bumphf I am non the wiser really about
how confortable we will be with this WRT the operational issues
like ability to kill jobs, trace users etc.
Really without testing at RAL I would not be prepared to buy a pig in a
poke. We had no effort back some months to do this but could do so now
via a variety of routes - for example the PPS.
Andrew
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]] On Behalf Of Alessandra Forti
> Sent: 04 July 2007 08:50
> To: [log in to unmask]
> Subject: Re: UK input to tomorrow's WLCG GDB
>
>
> Hi John,
>
> indeed. The wiki is not complete, and it is there to be completed.
> Developers were asked by the TCG insert their information,
> but haven't
> done it so far. And I already asked two times to the dteam to put in
> their own while we were discussing this but nobody has done it so far.
>
> cheers
> alessandra
>
> Gordon, JC (John) wrote:
> > Thanks Graeme, I knew this had been discussed at length but when
> > speaking in a meeting one can't say, just follow this thread. I
> > checked the wiki and it doesn't go into this detail. Jeremy
> needs the
> > good summary you give.
> >
> > John
> >
> > -----Original Message-----
> > From: Testbed Support for GridPP member institutes
> > [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
> > Sent: 03 July 2007 17:27
> > To: [log in to unmask]
> > Subject: Re: UK input to tomorrow's WLCG GDB
> >
> > On 3 Jul 2007, at 16:29, Coles, J (Jeremy) wrote:
> >
> >> Dear All
> >>
> >> Tomorrow there is a GDB (happens monthly as I hope you
> know!) at CERN
> >> with the following agenda:
> >> http://indico.cern.ch/conferenceDisplay.py?confId=8485
> >>
> >> If you have any important issues that you would like raised/
> >> discussed in
> >> relation to any of these items (or others) please let me
> know. Current
> >> items to be take up from the UK include:
> >>
> >> 1) Confirmation of experiment readiness to move to SL4
> >>
> >> 2) Confirmation that a well defined list of rpms required by the
> >> experiments but not in the standard SL4 installation is available
> >> (either as a list in the VO ID card for the experiment or as an
> >> experiment meta-package).
> >
> > If ATLAS and LHCb say that they are ready to move on this then
> > Glasgow are prepared to go early on this - perhaps at the
> end of this
> > month.
> >
> > However, this will almost certainly be a big bang switch, not a
> > gradual migration of worker nodes.
> >
> >> 3) To re-state that UK sites are generally opposed to running
> >> glexec on
> >> worker nodes (see this for background
> >> http://www.sysadmin.hep.ac.uk/wiki/Glexec). I have requested more
> >> information about specific objections via the T2 coordinators.
> >
> > Comments from an earlier email, with some clarifications (our
> > position hasn't altered):
> >
> > Begin forwarded message:
> >> We had a chat about glexec in our ScotGrid technical meeting
> >> yesterday.
> >>
> >> Summary: it's unacceptable for glexec to be deployed with suid
> >> privileges on our batch workers.
> >>
> >> The arguments have been made already on this thread, mainly by
> >> Kostas so there's little point in running over them in
> great detail
> >> again. However, briefly:
> >>
> >> 1. Edinburgh are integrating into a central university resource.
> >> glexec would not be acceptable to the system team.
> >
> > So here we _cannot_ run glexec. It's not our choice...
> >
> >> 2. Glasgow do control their resource, but all suid binaries on the
> >> batch workers are going to be turned off (sorry, no ping :-). We
> >> don't have confidence in glexec.
> >
> > It's just a foolish thing to do, in our opinion. SUID binaries are a
> > serious security risk. You just have to look at examples
> spread over
> > the years (sudo, suidperl) to see that code which has been
> available
> > for years can suddenly be discovered to be vulnerable. In
> addition,
> > even if the code is audited now, what guarantee do we have that
> > changes in the future won't open up attack vectors?
> >
> > Our opinion is that this is a problem of the VO's making (see 4).
> >
> >> 3. ...
> >
> > No longer an issue. glexec on the CE is different, because it's the
> > gatekeeper code which is being executed (to get the job into the
> > batch system), not the job payload. (A necessary evil here, we
> > believe...)
> >
> >> 4. What we want from pilot jobs is _traceability_, i.e., a record
> >> of who's payload was actually executed. Having glexec do suid
> >> twiddles is a baroque and dangerous way of achieving this.
> We'd be
> >> much happier with a query mechanism into the VO's job queue which
> >> allowed us to look at who delivered the payload. Far simpler and
> >> less dangerous, thanks. (Note, if the VOs insist on sending pilot
> >> jobs and getting themselves into a traceability pickle
> then asking
> >> sites to sort this mess by installing a suid binary for them is
> >> laughable. We hold them responsible for their, collective,
> actions.
> >> They have made their bed, let them lie in it - see the JSPG
> >> recommendations: http://www.sysadmin.hep.ac.uk/wiki/
> >> Pilot_Jobs#JSPG_.28Joint_Security_Policy_Group.29_Raccomandation)
> >
> > We will continue to run pilot jobs, e.g., from LHCb. We just won't
> > let them suid themselves to other pool accounts.
> >
> > Kostas' comments on how glexec interacts with the batch system we
> > echo:
> >
> >
> > Begin forwarded message:
> >> How are they going to use the scratch area that batch system
> >> alloted to
> >> the job since it is running under another uid?
> >> How can the batch system kill the job if it exceeded the cpu limit?
> >> How can the batch system kill runaway process sessions at
> the end of
> >> the job?
> >> How can I keep accurate accounting for cpu/memory/io if the jobs
> >> aren't
> >> running under the control of the batch system?
> >> How can I prevent the pilot job running N jobs instead of 1
> >> stealing cpu
> >> cycles from the other jobs in the system if they are not under the
> >> control of the batch system?
> >
> > Is that clear enough?
> >
> >> 4) Clarification on how vulnerabilities in experiment/VO code
> >> should be
> >> handled.
> >
> > Examples? It's up to the VOs to protect the resources we give them.
> > We'll bill them for everything ;-)
> >
> > Hope that helps
> >
> > Graeme
> >
> > --
> > Dr Graeme Stewart -
> http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
> > ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
> >
>
> --
> Alessandra Forti
> NorthGrid Technical Coordinator
> University of Manchester
>
|