I guess I am stuck on the fence about this - which I acknowledge isn't a useful, helpful or confortable place to be. Basically i am uneasy about glexec for the reasons outlined but pragmatically may be prepared to run it if it was essential to do so. However despite reading all the bumphf I am non the wiser really about how confortable we will be with this WRT the operational issues like ability to kill jobs, trace users etc. Really without testing at RAL I would not be prepared to buy a pig in a poke. We had no effort back some months to do this but could do so now via a variety of routes - for example the PPS. Andrew > -----Original Message----- > From: Testbed Support for GridPP member institutes > [mailto:[log in to unmask]] On Behalf Of Alessandra Forti > Sent: 04 July 2007 08:50 > To: [log in to unmask] > Subject: Re: UK input to tomorrow's WLCG GDB > > > Hi John, > > indeed. The wiki is not complete, and it is there to be completed. > Developers were asked by the TCG insert their information, > but haven't > done it so far. And I already asked two times to the dteam to put in > their own while we were discussing this but nobody has done it so far. > > cheers > alessandra > > Gordon, JC (John) wrote: > > Thanks Graeme, I knew this had been discussed at length but when > > speaking in a meeting one can't say, just follow this thread. I > > checked the wiki and it doesn't go into this detail. Jeremy > needs the > > good summary you give. > > > > John > > > > -----Original Message----- > > From: Testbed Support for GridPP member institutes > > [mailto:[log in to unmask]] On Behalf Of Graeme Stewart > > Sent: 03 July 2007 17:27 > > To: [log in to unmask] > > Subject: Re: UK input to tomorrow's WLCG GDB > > > > On 3 Jul 2007, at 16:29, Coles, J (Jeremy) wrote: > > > >> Dear All > >> > >> Tomorrow there is a GDB (happens monthly as I hope you > know!) at CERN > >> with the following agenda: > >> http://indico.cern.ch/conferenceDisplay.py?confId=8485 > >> > >> If you have any important issues that you would like raised/ > >> discussed in > >> relation to any of these items (or others) please let me > know. Current > >> items to be take up from the UK include: > >> > >> 1) Confirmation of experiment readiness to move to SL4 > >> > >> 2) Confirmation that a well defined list of rpms required by the > >> experiments but not in the standard SL4 installation is available > >> (either as a list in the VO ID card for the experiment or as an > >> experiment meta-package). > > > > If ATLAS and LHCb say that they are ready to move on this then > > Glasgow are prepared to go early on this - perhaps at the > end of this > > month. > > > > However, this will almost certainly be a big bang switch, not a > > gradual migration of worker nodes. > > > >> 3) To re-state that UK sites are generally opposed to running > >> glexec on > >> worker nodes (see this for background > >> http://www.sysadmin.hep.ac.uk/wiki/Glexec). I have requested more > >> information about specific objections via the T2 coordinators. > > > > Comments from an earlier email, with some clarifications (our > > position hasn't altered): > > > > Begin forwarded message: > >> We had a chat about glexec in our ScotGrid technical meeting > >> yesterday. > >> > >> Summary: it's unacceptable for glexec to be deployed with suid > >> privileges on our batch workers. > >> > >> The arguments have been made already on this thread, mainly by > >> Kostas so there's little point in running over them in > great detail > >> again. However, briefly: > >> > >> 1. Edinburgh are integrating into a central university resource. > >> glexec would not be acceptable to the system team. > > > > So here we _cannot_ run glexec. It's not our choice... > > > >> 2. Glasgow do control their resource, but all suid binaries on the > >> batch workers are going to be turned off (sorry, no ping :-). We > >> don't have confidence in glexec. > > > > It's just a foolish thing to do, in our opinion. SUID binaries are a > > serious security risk. You just have to look at examples > spread over > > the years (sudo, suidperl) to see that code which has been > available > > for years can suddenly be discovered to be vulnerable. In > addition, > > even if the code is audited now, what guarantee do we have that > > changes in the future won't open up attack vectors? > > > > Our opinion is that this is a problem of the VO's making (see 4). > > > >> 3. ... > > > > No longer an issue. glexec on the CE is different, because it's the > > gatekeeper code which is being executed (to get the job into the > > batch system), not the job payload. (A necessary evil here, we > > believe...) > > > >> 4. What we want from pilot jobs is _traceability_, i.e., a record > >> of who's payload was actually executed. Having glexec do suid > >> twiddles is a baroque and dangerous way of achieving this. > We'd be > >> much happier with a query mechanism into the VO's job queue which > >> allowed us to look at who delivered the payload. Far simpler and > >> less dangerous, thanks. (Note, if the VOs insist on sending pilot > >> jobs and getting themselves into a traceability pickle > then asking > >> sites to sort this mess by installing a suid binary for them is > >> laughable. We hold them responsible for their, collective, > actions. > >> They have made their bed, let them lie in it - see the JSPG > >> recommendations: http://www.sysadmin.hep.ac.uk/wiki/ > >> Pilot_Jobs#JSPG_.28Joint_Security_Policy_Group.29_Raccomandation) > > > > We will continue to run pilot jobs, e.g., from LHCb. We just won't > > let them suid themselves to other pool accounts. > > > > Kostas' comments on how glexec interacts with the batch system we > > echo: > > > > > > Begin forwarded message: > >> How are they going to use the scratch area that batch system > >> alloted to > >> the job since it is running under another uid? > >> How can the batch system kill the job if it exceeded the cpu limit? > >> How can the batch system kill runaway process sessions at > the end of > >> the job? > >> How can I keep accurate accounting for cpu/memory/io if the jobs > >> aren't > >> running under the control of the batch system? > >> How can I prevent the pilot job running N jobs instead of 1 > >> stealing cpu > >> cycles from the other jobs in the system if they are not under the > >> control of the batch system? > > > > Is that clear enough? > > > >> 4) Clarification on how vulnerabilities in experiment/VO code > >> should be > >> handled. > > > > Examples? It's up to the VOs to protect the resources we give them. > > We'll bill them for everything ;-) > > > > Hope that helps > > > > Graeme > > > > -- > > Dr Graeme Stewart - > http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart > > ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/ > > > > -- > Alessandra Forti > NorthGrid Technical Coordinator > University of Manchester >