Without any intention to terminate a valuable discussion, but feeling I do not have current, technical understanding to add to it (I can't read fast enough to keep up here!), I can offer to try to co-ordinate the production of a (hopefully) reasonably concise summary of the issues raised for some future consensus. Does that sound like a good idea?
I'd also be interested to know if there are sites with hard site-policy objections to running glexec in setuid/non-logging mode?
-- Ian Neilson
-- GridPP Security Officer
-- [log in to unmask]
-- +44 7554 107 132
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Gareth Roy
> Sent: 07 August 2015 09:50
> To: [log in to unmask]
> Subject: Re: glexec for small VOs
> Hi Ewan,
> > On 6 Aug 2015, at 16:57, Ewan MacMahon
> <[log in to unmask]> wrote:
> >> -----Original Message-----
> >> From: Testbed Support for GridPP member institutes [mailto:TB-
> >> [log in to unmask]] On Behalf Of Gareth Roy
> >> For a start there is a significant difference between a multiuser pilot
> >> and a multi-VO pilot. In a multiuser pilot, a privilege escalation due to
> >> credential theft can only raise the ability of that user within that VO
> >> (i.e. gaining a production role for instance) while in a multi-VO pilot a
> >> privilege escalation could give a role (of any type) within a VO that user
> >> is not part of. Now you might say that’s not a huge difference but I feel
> >> the second is worse than the first as at least the VO in question has some
> >> oversight.
> > I think that's something worth thinking about, taking into account that aspect
> of it along with how hard it would be to have separate pilot DNs per VO, but it
> doesn't affect the action being asked of sites - a site needs to have a mechanism
> for isolating user payloads, glExec or otherwise, in either case.
> I agree, however sites already have a mechanism for isolating payloads with the
> pool account model were one DN is mapped to a known pool account (usually
> via ARGUS). With pilots that breaks down and there is no longer a 1:1 mapping,
> hence glExec which I understood to be more about auditing and traceability than
> isolation since pilots used to run without accessing glExec (and still do at sites
> without it configured i.e. tarball sites). My issue here is we are moving one step
> further in removing that mapping as we now don’t even have a 1:1 mapping of
> VO to pilot, making traceability even more difficult without some intervention
> from a site. What is actually being asked is for a site to modify how they isolate
> user payloads and to accept some responsibility for auditing job submission and
> Take a hypothetical example: a pilot lands, the payload steals the pilot proxy
> (before glExec due to a vulnerability in the pilot ) and uses it to download
> another users payload… then in fact uses glExec to pivot to that pool account
> and start running as that user. How do we as a site audit that? If the pilot is
> restricted to one VO we know who the pilot came from and potentially can gain
> help from that VO, but from a multi-VO pilot how do we even tell which VO
> instigated the initial launch? Does Dirac have the records? If it does any security
> breach now requires the Imperial admins to be in the loop and to be ready to
> help as well as VO managers. Do the Imperial guys want to have that level of
> responsibility? Whereas with WMS it was useful to have admins in the loop, now
> it is essential!
>  I know this is unlikely but having looked at the DIRAC source code it looks like
> the pilot is partially dynamic had has chunks created on the fly at submission
> time? Have I understood that correctly or totally in left field, it’s been a long
> time since I’ve actively used python. I assume as part of LHCb DIRAC as a whole
> has had a security audit but bugs are bugs.
> >> Once credentials have been shipped by DIRAC to a site it’s
> >> similar to the issues with DRM, everything that’s needed to run the code
> >> is in place and the submission system or even the VO doesn’t have control
> >> anymore.
> > That's true, but I think the DRM comparison is useful - whoever controls the
> hardware controls the security, that's why it's the site's responsibility to provide
> this isolation. Site admins and the systems we run are unavoidably trusted in this
> system, we can already steal credentials that wind up on our nodes, and we
> can misconfigure our nodes in a wide variety of ways. We just have to not.
> Besides which, glExec is pretty easy to configure - you just YAIM it.
> Unless you’re trying to remove YAIM because it’s likely to be unsupported, or
> you’d like better control of the components than YAIM affords, for instance
> adding post auditing hooks to glExec calls (which as an aside looks very cool and
> definitely something we’re going to look at if we run multi-VO pilots). As I site
> admin I don’t trust a users payload… but I have to make allowances to actually
> run jobs. In this case as a site admin I’m being asked to have an additional pilot
> framework run at the site that has the capability to run a setuid binary. I just
> want to make sure I understand the risks and can appropriately protect myself
> before that happens. As you said in DRM it’s the hardware’s responsibility and to
> be honest the best solution they’ve come up with is to have half the data live on
> a remote server and be streamed based on arcane access controls :D
> >> I know, in general, all these issues are the same as running “big” VO jobs
> >> but in my limited experience working with the Grid we’ve had little
> >> problem with large VO’s doing something naughty (due to other constraints
> >> and security mechanisms) and a lot of problems with small VO users doing
> >> things they shouldn’t with little oversight by VO managers.
> > I think that's fair, and that's partly why we probably don't want to fudge this
> and not bother doing it right any more; it is more important to
> > box in the minor VOs than the big ones.
> Oh I agree entirely, this is very much something I don’t want fudged… I would
> like to really understand the risks and the mitigation so I know where to point
> my monitoring tools and audits so I can catch something when (not if) it goes