Hi David,
I cannot comment on the policy as I haven't seen it yet. I don't think
it has been broadcasted to site security lists for comments yet.
I don't want to ban pilot jobs I just want them not to make glexec
compulsory. You might have your own reasons to want it I have my reasons
not to want it (I don't give sudo privileges to users).
As for the paper I'll be glad to see it when it comes out. I was
expcecting you started at least to write on the wiki in the meantime.
cheers
alessandra
David Groep wrote:
> Hi Alessandra,
>
> It's good to see that at least there is a substantial discussion
> about glexec, the code and code review aspects, and the requirements
> and deployment scenarios. There are two or three threads of discussion in
> your email that I think it's good to separate out:
>
> - code review and quality
> What you are looking at now is code that is in preview stage as
> far as gLite is concerned: it's installed on selected preview testbed
> sites, and is being considered for inclusion in subsequent releases.
> I feel, and I think you also express that feeling, that before it goes
> to production, such a security-critical piece of code should be
> reviewed by external people. I would like to see and encourage such
> a review, as is being done now on the VOMS code, before it goes
> to the sites.
>
> The fact that FNAL deployed this preview version on their systems
> out of urgent necessity and for regulatory reasons, does not need
> to imply that any EGEE site should do likewise.
>
> - effects of glexec in operational environments
> In this respect, glexec behaves /almost/ as a standard sudo, in
> that it heeps the process tree (and thus the accounting), and
> if a job is killed by root, this killing will affect the entire
> process tree, including any processes with a different uid/gid.
> Of course, glexec does not give you additional protection
> against process that daemonize, but it also does not lower the
> level of protection.
> Note that this in *independent* of any epilogues, batch system
> types, or even the fact that you are running any batch system. If
> also holds for jobs that are directly forked
>
> The difference between glexec and sudo is that glexec will keep
> open any file descriptors across the uid change, so that you
> can communicate with your child (i.e. send it instructions or
> kill it externally).
>
> This all follows standard unix semantics, and is not specific to
> PBS, LSF, Condor, or any other system. And certainly it is not
> linked to the existence of any specific NIKHEF installation or
> epilogue scripts -- our SW development and operational activities
> are quite separated to prevent "leakage" of assuptions either way.
>
> - policy issues with respect to pilot jobs and glexec-on-WN
> scenarios
>
> These issues should not be raised (again) with the developers, but
> be brought to the attention of the policy bodies (GDB, ROC Managers)
> etc. The JSPG is drafting a policy on this issue, as you are no doubt
> aware, that was presented by Dave Kelsey already to the LCG GDB.
> It will in due to also be presented to EGEE management &c.
> It leaves the sites the choice of several models (no glexec, non-
> privileged glexec, suid-glexec)
> There are several deployment scenarios, and each of them is
> suited to a specific operational, regulatory and legal environment
> that may exist at a particular site.
> Rightfully, IMHO, the draft policy outlines several options, and how
> the VO should in each of these conditions fully comply with the site
> requirements.
>
> This should also address the quality of (VO developed) software, and
> how that deals with their own security issues, much like the
> WMS/RB does today in other models. Also on this issue, there is a
> draft policy being circulated AFAIK.
>
> Then, the fact whether or not you like or want to accept pilot
> jobs is a site choice. Some VOs, for some reason or another, seem
> to be quite fond of them, whilst indeed the majority of the VOs
> are satisfied with the regular submission model, or indeed could
> not do anything else since a pilot job model would impair too
> much functionality.
>
> What glexec gives you, is the possibility for policy-compliant VOs
> to add your site specific authorisation requirements, and a way to
> enfore your own policy on top o what the VO send you. Via glexec,
> you gain (even in a non-setuid model) the possibility to inform the
> VO pilot job of your site's policy, so that a cooperating VO will then
> not start such a job.*
> By adding setuid capabilities, you gain the possibility to trace
> individual processes at the unix level on shared multi-user systems,
> such as batch nodes that are used by more than one job concurrently.
> If you have only one job per node (as is common in many low-latency
> HPC environments), the setuid capability is superfluous anyway, and
> in those cases I would personally recommend agains setuid (but keep
> glexec in non-setuid mode to enforce my own authZ decisions and
> site-bin-lists).
>
> * if you find that a VO violates this policy, you can always ban the VO,
> and with a good reason...
>
> I hope you, and many others, will appreciate that a more in-depth
> paper on glexec will be forthcoming over the next two month, that should
> explain a bit more about the rationale and deployment models that lead
> us to the develpoment of this component.
>
> Cheers,
> DavidG.
>
>
> Alessandra Forti wrote:
>> Hi Oscar,
>>
>> since the fact that glexec is derived from suexec in one of the . Tell
>> me what you read in the first 100 or so lines
>>
>> http://httpd.apache.org/docs/2.0/suexec.html
>>
>> glexec code has never been extensively tested.
>>
>> Kostas has found already at list half a dozen problems and two or 3
>> bugs just glancing at the code.
>>
>> In the past month we have had 3 cases of improperly set permissions
>> that allow to delete files. I cannot even think about if this happened
>> when sites as big as liverpool and manchester deploy this stuff.
>>
>> Not all the sysadmins are acquanted with suexec configuration and
>> glexec configuration might be similar but surely it must have an extra
>> layer of complication since it is connected to lcas/lcmaps.
>>
>> The glexec executable should be called by user code. Have they
>> mustered delegation code? or are they still planning to use gridftp to
>> download the proxies from the server? In any case I do not trust the
>> users to be able to write any secure code by default. It is simply not
>> ingrained in their mentality.
>>
>> Other problems are certainly the way VOs are trying to optimise job
>> submission on shared resources introducing extraneous software like
>> glexec on the worker nodes. Not all the clusters are dedicated, not
>> all the clusters use PBS on which you are basing most of your
>> deployment trials.
>>
>> In a previous email you stated that glexec doesn't interfere with the
>> normal batch system operation. Changing UID won't affect accounting,
>> automatic creation of directories, killing of daemonised and runaway
>> processes because anyway that's a problem that can be solved in the
>> epilogue script as it is done now because the sid tree is preserved.
>> Which means you are considering strictly speaking only the epilogue
>> nikhef is using. True, we have given it a big push to be deployed
>> elsewhere but this is not compulsory. This without counting that
>> different sites might use different batch systems.
>>
>> There is also the question that the pilot job in this scheme is not
>> run with a user proxy but with a service proxy. Tell me ho do you call
>> a job that can run for up to 72/92h (default cputime/walltime setup by
>> YAIM) contact services, pull other people jobs and use other people
>> proxies all this while changing UID and without even an owner because
>> suddenly this is a service? To me it seems a VOBOX on the WN if the
>> word permanent is replaced by days.
>>
>> And if we reduce the queues to a ~11 hours so that they can run only
>> one job? Where do they get the advantage to use this model and why
>> should I introduce something as potentially dangerous as glexec on
>> hundreds on nodes? As a matter of fact I can't think about debugging a
>> problem with changes of IDs and file ownership in the log files
>>
>> Certainly we also question the way users optimise their job submission
>> as we are all working at a project that established that a push model
>> was optimal while some users decided that they wanted a pull model on
>> top. So yes there are other problems, but the main one for me is a
>> setuid program on the WN. It beats me that people can't see it.
>>
>> cheers
>> alessandra
>>
>> Oscar Koeroo wrote:
>>
>>> Hi,
>>>
>>> If these questions are raised, then I think that sites should ban in-
>>> and outbound network connectivity from the WNs or use a network
>>> arbitrator.
>>>
>>> Technical software issues in glexec doesn't seem to be the core of
>>> the problem.
>>>
>>> Users (and their VOs) have their reasons for working around the
>>> regular queues to send work to a WN in a more optimal way, in the
>>> user perspective, for execution. We simply provide a tool that can
>>> perform needed authorization checks and user switching where it
>>> wasn't before.
>>>
>>> I don't understand how this tool would be going against site
>>> policies. It serves the purpose also for both the site and the VOs
>>> themselves to have more control over what is executed by who. Without
>>> it, you wouldn't know who has executed which part of the real user
>>> jobs' payload from within a pilot job.
>>>
>>> As I would understand it, the glexec tool would aid the security
>>> infrastructure by being able to tell more about the pilot job. This
>>> should be in coherence with the VO's pilot job infrastructure.
>>>
>>>
>>> cheers,
>>>
>>> Oscar
>>>
>>>
>>>
>>>
>>>
>>> Cornwall, LA (Linda) wrote:
>>>
>>>> Dear UK TB support, GSVG RAT, Kostas, Alessandra, Oscar, SCG,
>>>>
>>>> It looks like multiple threads have developed concerning glexec, and in
>>>> summary the problems seem to be:--
>>>>
>>>> Pilot jobs turn the push model into a pull model, is this acceptable at
>>>> all?
>>>>
>>>> Does the Glexec/pilot job design in principle contradict security
>>>> requirements? They have not been updated for a while, but for
>>>> example I quote from the
>>>> EGEE(I) requirements
>>>> (https://edms.cern.ch/file/485295/1/EGEE-JRA3-TEC-485295-UserReq-v1-0.pd
>>>>
>>>> f )
>>>> In the Auditing requirements
>>>> "It must be possible to trace the distinguished name (DN) of the
>>>> certificate used for the original job submission."
>>>>
>>>> Does the Glexec/pilot job design in principle introduce vulnerabilities
>>>> that are inherent in the design, rather than being bugs that can be
>>>> fixed. Hence we have a serious vulnerability issue that needs careful
>>>> consideration with SCG, TCG and others and a redesign/rewrite is
>>>> needed.
>>>>
>>>> Does the Glexec/pilot job design in principle contradict the agreed
>>>> policy?
>>>>
>>>> Does the way Glexec is being used by VOs contradict the agreed policy?
>>>> Is there something else wrong with glexec that is obvious to sites?
>>>> I can't help thinking if Kostas and Alessandra are not happy something
>>>> isn't right.
>>>>
>>>> Glexec has some implementation flaws, which can be fixed as a
>>>> straightforward vulnerability bug.
>>>>
>>>> It seems to me that something may have gone wrong between satisfying
>>>> security requirements, ensuring design flaws that cause vulnerabilities
>>>> are not present, ensuring design flaws that contradict policy needs
>>>> are
>>>> not introduced... This is not just a UK TB matter, or just an
>>>> operational matter, but something that needs investigating to find
>>>> whether or not there is a serious problem. Linda
>>>>
>>>>
>>>>
>>
>
>
--
Alessandra Forti
NorthGrid Technical Coordinator
University of Manchester
|