On 8 December 2014 at 13:26, Andrew McNab <[log in to unmask]> wrote:
> On 8 Dec 2014, at 10:55, Sam Skipsey <[log in to unmask]> wrote:
>> On 5 December 2014 at 19:25, Andrew McNab <[log in to unmask]> wrote:
>>> On 5 Dec 2014, at 16:30, Sam Skipsey <[log in to unmask]> wrote:
>>>> As requested in the meeting, here's a presentation-format discussion
>>>> of non-Grid-Proxy mechanisms as an idea. (I think Andy and I were
>>>> talking across purposes for some of our disagreement, so hopefully
>>>> this clarifies the position.)
>>>
>>> The central problem is how a third party service like a storage server can trust that a particular job is acting on behalf of a user. This always requires either giving a secret to the job (e.g. a password, or a proxy in a sandbox) or doing something equivalent to proper proxy delegation in which the private keys never go over the network.
>>>
>>> As I said during the meeting, the problem with your model is that the signature of the job becomes a replayable secret like a password.
>>
>> No, it doesn't (or rather, "replayable secrets" are pretty much a
>> solved problem - you just embed a sequence number into the secret to
>> prevent it being replayed, and then you just have to save a sequence
>> number, which is less effort than your strawman below).
>
> Trying to patch the inherent replay-ability of signed jobs with things like sequence numbers means you have to go further and further in the direction of deciding what service instances you’re going to be talking to before submitting the job. That makes it harder and harder to build resilient systems that can cope with changing patterns of load at different places, and operations that fail and need retrying. With another pre-created signature and sequence number somehow? That the storage at some other site has to keep track of too somehow? How do you differentiate between a failover to another storage site and a malicious replay? Some protocol for storage sites to talk to each other about this in the background? In big distributed systems like ours this gets very complicated very quickly.
>
But also, it's not a serious problem for read requests for WLCG VOs
(and write requests, if allowed to be idempotent, reduce to a race
condition, and then can't be replayed).
>> Note that I do address this later in the slides - and, in any case,
>> you can only do "malicious" versions of the writes that you explicitly
>> allow. [A "malicious" read, for WLCG at least, is not anywhere near as
>> much of a problem.]
>
> You can try to patch it by locking things down with more and more restrictions written ahead of time about what will be done where. But this makes a system which is more and more brittle.
>
Constraints do unavoidably increase the brittleness of a system, yes.
However, the degree of brittleness, once again, has to be balanced
against the security benefits.
>> I also strongly question your assertion that proxies are based on a
>> "secret which never needs to be disclosed" - it's very clearly the
>> case that in almost all implementations of proxy delegation systems,
>> the proxy itself is sent over the wire to authorise everything.
>
> This isn’t the case with DIRAC for instance. So the pilot job proxy is properly delegated via the CREAM CE on the way in, and then that proxy is used to get a proxy for the user from the DIRAC ProxyManager. The proxy manager client generates a key and certificate request on the worker node, sends the request only to the ProxyManager, which signs it to make the additional proxy certificate and sends the now lengthened proxy chain back to the client on the WN. The WN now has a proxy chain going all the way back to the user, and a private key that matches it that has never gone over the network. It’s the same process in the VMs for DIRAC too, except using an original credential owned by the resource provider.
>
And this isn't an extremely complicated long-chain solution that
introduces multiple points of potential 'brittleness'? [It also relies
on a central authority, the DIRAC ProxyManager, which apparently needs
to be able to delegate potentially any users' proxies to any pilot
proxy that talks to it. That makes pilot proxies potentially even more
disastrous to go missing than normal user proxies, as apparently you
can then get proxies for any user that the DIRAC ProxyManager knows
about?] It also doesn't even (except potentially on VMs) solve the
problem that the presentation is mostly concerned with: of throwing
around unlimited user proxies onto systems you don't control - in
fact, in this case, you've also given the system a pilot proxy that
can be used to gain credentials for a huge number of other users. That
looks like you've significantly *widened* the attack surface for
proxies, for an attacker on a WN. [This was the original issue which
lead to the whole glExec push, as I recall.]
(I'd also like to know how this is handled in the DIRAC interfaces for
non-CREAM CE systems - the CondorCE and ARC CE both do proxy
delegation slightly differently to CREAM.)
>>> Furthermore, I don’t really buy your answer to minimise the scope of replay attacks on slide 19 that asserts “well written code will always know which files it will need before execution, and which files it expects to produce when complete. We can therefore map these to a series of Transactions: Source -> Destination”. Pilot jobs are so successful because they allow late binding, and moving to event servers pushes that binding even later. We want jobs to have more flexibility not less about what they do when they land on a worker node in response to the available cores, memory, network conditions, and what has already been processed.
>>
>> Certainly, I accept that there are people who want that. You will have
>> to accept that I disagree as to if this is actually necessary - and
>> certainly as to if it is necessary to use a pilot job mechanism to do it.
>
> And experiments have clearly voted with their feet on late binding over the years for good operational reasons. It’s a required feature not just one alternative. Brittle, “ideal world” systems aren’t any use to us. We went through all that with the original EDG push model.
>
I agree that the EDG push model was poorly designed. Its faults were
mostly in the fact that it tried to be a centralised
"complete-knowledge" system, in a distributed environment where the
amount of knowledge was large and unmanageable, however. This is
rather different to the issue you have with the transactional/signing
proposal here.
The security approaches involved in signed transactions are no more
centralised than the DIRAC approach appears to be (both of them
admittedly require some central trusted authority to arbitrate some
security delegation - in the case of the transactional approach, some
trusted file access redirector needs to be available to point at SEs).
The constraint application here is not explicitly even opposed to late
binding - you could move much of the transactional approach to signing
'workpackets', rather than 'jobs' as you're thinking about them, and
gain flexibility [at some point, there must be *some* level of the
experiment workflow which is actually semideterministic code,
otherwise your management issues must be fantastic].
(Even when performing elastic MonteCarlo Production, which I know LHCb
is interested in, I don't see that that providing a preallocated
filename for the output would necessarily prevent the job itself from
dynamically adjusting its runtime to that remaining for it.)
Sam
> Cheers
>
> Andrew
> --
> Dr Andrew McNab, High Energy Physics,
> University of Manchester, UK, M13 9PL.
> Skype and Google Talk: andrew.mcnab.uk
> Web: www.hep.manchester.ac.uk/u/mcnab
|