Print

Print


I guess I am stuck on the fence about this - which I acknowledge isn't a
useful, helpful or confortable place to be. 

Basically i am uneasy about glexec for the reasons outlined but
pragmatically may be prepared to run it if it was essential to do so.
However despite reading all the bumphf I am non the wiser really about 
how confortable we will be with this WRT the operational issues
like ability to kill jobs, trace users etc.

Really without testing at RAL I would not be prepared to buy a pig in a
poke. We had no effort back some months to do this but could do so now
via a variety of routes - for example the PPS.

Andrew

> -----Original Message-----
> From: Testbed Support for GridPP member institutes 
> [mailto:[log in to unmask]] On Behalf Of Alessandra Forti
> Sent: 04 July 2007 08:50
> To: [log in to unmask]
> Subject: Re: UK input to tomorrow's WLCG GDB
> 
> 
> Hi John,
> 
> indeed. The wiki is not complete, and it is there to be completed. 
> Developers were asked by the TCG insert their information, 
> but haven't 
> done it so far. And I already asked two times to the dteam to put in 
> their own while we were discussing this but nobody has done it so far.
> 
> cheers
> alessandra
> 
> Gordon, JC (John) wrote:
> > Thanks Graeme, I knew this had been discussed at length but when 
> > speaking in a meeting one can't say, just follow this thread. I 
> > checked the wiki and it doesn't go into this detail. Jeremy 
> needs the 
> > good summary you give.
> > 
> > John
> > 
> > -----Original Message-----
> > From: Testbed Support for GridPP member institutes 
> > [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
> > Sent: 03 July 2007 17:27
> > To: [log in to unmask]
> > Subject: Re: UK input to tomorrow's WLCG GDB
> > 
> > On 3 Jul 2007, at 16:29, Coles, J (Jeremy) wrote:
> > 
> >> Dear All
> >>
> >> Tomorrow there is a GDB (happens monthly as I hope you 
> know!) at CERN 
> >> with the following agenda: 
> >> http://indico.cern.ch/conferenceDisplay.py?confId=8485
> >>
> >> If you have any important issues that you would like raised/
> >> discussed in
> >> relation to any of these items (or others) please let me 
> know. Current
> >> items to be take up from the UK include:
> >>
> >> 1) Confirmation of experiment readiness to move to SL4
> >>
> >> 2) Confirmation that a well defined list of rpms required by the 
> >> experiments but not in the standard SL4 installation is available 
> >> (either as a list in the VO ID card for the experiment or as an 
> >> experiment meta-package).
> > 
> > If ATLAS and LHCb say that they are ready to move on this then
> > Glasgow are prepared to go early on this - perhaps at the 
> end of this  
> > month.
> > 
> > However, this will almost certainly be a big bang switch, not a
> > gradual migration of worker nodes.
> > 
> >> 3) To re-state that UK sites are generally opposed to running
> >> glexec on
> >> worker nodes (see this for background
> >> http://www.sysadmin.hep.ac.uk/wiki/Glexec). I have requested more
> >> information about specific objections via the T2 coordinators.
> > 
> > Comments from an earlier email, with some clarifications  (our
> > position hasn't altered):
> > 
> > Begin forwarded message:
> >> We had a chat about glexec in our ScotGrid technical meeting
> >> yesterday.
> >>
> >> Summary: it's unacceptable for glexec to be deployed with suid
> >> privileges on our batch workers.
> >>
> >> The arguments have been made already on this thread, mainly by
> >> Kostas so there's little point in running over them in 
> great detail  
> >> again. However, briefly:
> >>
> >> 1. Edinburgh are integrating into a central university resource.
> >> glexec would not be acceptable to the system team.
> > 
> > So here we _cannot_ run glexec. It's not our choice...
> > 
> >> 2. Glasgow do control their resource, but all suid binaries on the
> >> batch workers are going to be turned off (sorry, no ping :-). We  
> >> don't have confidence in glexec.
> > 
> > It's just a foolish thing to do, in our opinion. SUID binaries are a
> > serious security risk. You just have to look at examples 
> spread over  
> > the years (sudo, suidperl) to see that code which has been 
> available  
> > for years can suddenly be discovered to be vulnerable. In 
> addition,  
> > even if the code is audited now, what guarantee do we have that  
> > changes in the future won't open up attack vectors?
> > 
> > Our opinion is that this is a problem of the VO's making (see 4).
> > 
> >> 3. ...
> > 
> > No longer an issue. glexec on the CE is different, because it's the
> > gatekeeper code which is being executed (to get the job into the  
> > batch system), not the job payload. (A necessary evil here, we  
> > believe...)
> > 
> >> 4. What we want from pilot jobs is _traceability_, i.e., a record
> >> of who's payload was actually executed. Having glexec do suid  
> >> twiddles is a baroque and dangerous way of achieving this. 
> We'd be  
> >> much happier with a query mechanism into the VO's job queue which  
> >> allowed us to look at who delivered the payload. Far simpler and  
> >> less dangerous, thanks. (Note, if the VOs insist on sending pilot  
> >> jobs and getting themselves into a traceability pickle 
> then asking  
> >> sites to sort this mess by installing a suid binary for them is  
> >> laughable. We hold them responsible for their, collective, 
> actions.  
> >> They have made their bed, let them lie in it - see the JSPG  
> >> recommendations: http://www.sysadmin.hep.ac.uk/wiki/ 
> >> Pilot_Jobs#JSPG_.28Joint_Security_Policy_Group.29_Raccomandation)
> > 
> > We will continue to run pilot jobs, e.g., from LHCb. We just won't
> > let them suid themselves to other pool accounts.
> > 
> > Kostas' comments on how glexec interacts with the batch system we 
> > echo:
> > 
> > 
> > Begin forwarded message:
> >> How are they going to use the scratch area that batch system
> >> alloted to
> >> the job since it is running under another uid?
> >> How can the batch system kill the job if it exceeded the cpu limit?
> >> How can the batch system kill runaway process sessions at 
> the end of
> >> the job?
> >> How can I keep accurate accounting for cpu/memory/io if the jobs  
> >> aren't
> >> running under the control of the batch system?
> >> How can I prevent the pilot job running N jobs instead of 1  
> >> stealing cpu
> >> cycles from the other jobs in the system if they are not under the
> >> control of the batch system?
> > 
> > Is that clear enough?
> > 
> >> 4) Clarification on how vulnerabilities in experiment/VO code
> >> should be
> >> handled.
> > 
> > Examples? It's up to the VOs to protect the resources we give them.
> > We'll bill them for everything ;-)
> > 
> > Hope that helps
> > 
> > Graeme
> > 
> > --
> > Dr Graeme Stewart - 
> http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
> > ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
> > 
> 
> -- 
> Alessandra Forti
> NorthGrid Technical Coordinator
> University of Manchester
>