On 3 Jul 2007, at 16:29, Coles, J (Jeremy) wrote:
> Dear All
>
> Tomorrow there is a GDB (happens monthly as I hope you know!) at CERN
> with the following agenda:
> http://indico.cern.ch/conferenceDisplay.py?confId=8485
>
> If you have any important issues that you would like raised/
> discussed in
> relation to any of these items (or others) please let me know. Current
> items to be take up from the UK include:
>
> 1) Confirmation of experiment readiness to move to SL4
>
> 2) Confirmation that a well defined list of rpms required by the
> experiments but not in the standard SL4 installation is available
> (either as a list in the VO ID card for the experiment or as an
> experiment meta-package).
If ATLAS and LHCb say that they are ready to move on this then
Glasgow are prepared to go early on this - perhaps at the end of this
month.
However, this will almost certainly be a big bang switch, not a
gradual migration of worker nodes.
>
> 3) To re-state that UK sites are generally opposed to running
> glexec on
> worker nodes (see this for background
> http://www.sysadmin.hep.ac.uk/wiki/Glexec). I have requested more
> information about specific objections via the T2 coordinators.
Comments from an earlier email, with some clarifications (our
position hasn't altered):
Begin forwarded message:
> We had a chat about glexec in our ScotGrid technical meeting
> yesterday.
>
> Summary: it's unacceptable for glexec to be deployed with suid
> privileges on our batch workers.
>
> The arguments have been made already on this thread, mainly by
> Kostas so there's little point in running over them in great detail
> again. However, briefly:
>
> 1. Edinburgh are integrating into a central university resource.
> glexec would not be acceptable to the system team.
So here we _cannot_ run glexec. It's not our choice...
>
> 2. Glasgow do control their resource, but all suid binaries on the
> batch workers are going to be turned off (sorry, no ping :-). We
> don't have confidence in glexec.
It's just a foolish thing to do, in our opinion. SUID binaries are a
serious security risk. You just have to look at examples spread over
the years (sudo, suidperl) to see that code which has been available
for years can suddenly be discovered to be vulnerable. In addition,
even if the code is audited now, what guarantee do we have that
changes in the future won't open up attack vectors?
Our opinion is that this is a problem of the VO's making (see 4).
> 3. ...
No longer an issue. glexec on the CE is different, because it's the
gatekeeper code which is being executed (to get the job into the
batch system), not the job payload. (A necessary evil here, we
believe...)
>
> 4. What we want from pilot jobs is _traceability_, i.e., a record
> of who's payload was actually executed. Having glexec do suid
> twiddles is a baroque and dangerous way of achieving this. We'd be
> much happier with a query mechanism into the VO's job queue which
> allowed us to look at who delivered the payload. Far simpler and
> less dangerous, thanks. (Note, if the VOs insist on sending pilot
> jobs and getting themselves into a traceability pickle then asking
> sites to sort this mess by installing a suid binary for them is
> laughable. We hold them responsible for their, collective, actions.
> They have made their bed, let them lie in it - see the JSPG
> recommendations: http://www.sysadmin.hep.ac.uk/wiki/
> Pilot_Jobs#JSPG_.28Joint_Security_Policy_Group.29_Raccomandation)
We will continue to run pilot jobs, e.g., from LHCb. We just won't
let them suid themselves to other pool accounts.
Kostas' comments on how glexec interacts with the batch system we echo:
Begin forwarded message:
> How are they going to use the scratch area that batch system
> alloted to
> the job since it is running under another uid?
> How can the batch system kill the job if it exceeded the cpu limit?
> How can the batch system kill runaway process sessions at the end of
> the job?
> How can I keep accurate accounting for cpu/memory/io if the jobs
> aren't
> running under the control of the batch system?
> How can I prevent the pilot job running N jobs instead of 1
> stealing cpu
> cycles from the other jobs in the system if they are not under the
> control of the batch system?
Is that clear enough?
>
> 4) Clarification on how vulnerabilities in experiment/VO code
> should be
> handled.
Examples? It's up to the VOs to protect the resources we give them.
We'll bill them for everything ;-)
Hope that helps
Graeme
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|