Hi Ewan,
Contrasting your experience with mine a few weeks ago with a user's jobs
that had a memory leaks and were knocking over worker nodes.
Once I realise that the jobs were pilot jobs and using gLexec (having loads
of process for that pool account on our worker nodes and qstat -u coming up
empty did give me a few worried moments before I tought "Pilot!")
I could then ban that DN in Argus to stop any more of their jobs starting
(though it would have helped if I'd remembered to flush the cache and start
the ban immediately).
Then I wanted to try to save as many of the remaining worker nodes as
possible (a lot of our older workers don't have IPMI configured and I didn't
want to have to go over the road in the rain).
I couldn't just do a "qdel `qselect -u <user>`" because the user had no jobs
and they seemed to be coming in to a number of pilot accounts that we also
running different users jobs as well.
Eventually I resorted to running "pkill -u <user>" on all WNs (through ssh
and cfengine shell commands).
So thanks to gLexec I was both able to prevent the problem getting any worse
once I had found it and minimise the disruption to other users.
I'm no great fan of gLexec but here it really helped.
Yours,
Chris.
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Ewan MacMahon
> Sent: 06 September 2011 11:40
> To: [log in to unmask]
> Subject: Re: GridPP operations meeting at 11am today
>
> > -----Original Message-----
> > From: Testbed Support for GridPP member institutes [mailto:TB-
> > [log in to unmask]] On Behalf Of Daniela Bauer
> > Sent: 06 September 2011 10:30
> > To: [log in to unmask]
> > Subject: Re: GridPP operations meeting at 11am today
> >
> >
> > I've attached the file. It basically contains a statement that Atlas
> does
> > not want to use gLexec plus their reasoning.
> >
>
> Well do want them to use it and here's my reasoning:
>
> - Contrary to what ATLAS claim, the panda mechanism does not
> provide useful traceability. As we recently found in an incident
> at Oxford in which a set of badly written user analysis job
> filled a worker node filesystem. The resulting files were owned
> by the generic pilot account, with no straightforward means to
> tie them to a particular user. Furthermore, even when we've
> managed to trace trouble to a specific batch job, PANDA has
> given us mappings to several different users' analysis payloads.
> To find out who's responsible for any misbehavior requires an
> essentially statistical approach of querying a lot of suspect
> jobs and seeing who's name comes up most often.
>
> - Panda does not provide useful user banning. In the above incident
> I could have recovered the site by banning the badly behaved user;
> given that site-admins don't have access to panda's user banning
> feature, and given that ATLAS have proven unwilling to use it on
> a site's behalf in the past, we have no way to deal with such
> problems. The ATLAS paper says:
> "Any Grid site should ban the ATLAS pilot DN or if necessary
> the entire ATLAS VO in case there is a suspicion of compromised
> credentials or illegal usage of resources at the site."
> However, we have in fact done this in the past under similar
> circumstances and the response from ATLAS was not positive to say
> the least. Besides which, we don't want to do this - we're trying
> to run an ATLAS service here, after all. In practice this proposal
> is simply unrealistic - banning all analysis isn't going to happen
> for anything but the most dramatic security incidents, and ATLAS
> know this.
>
> ATLAS' inability to make their software run with glExec is their
> problem, and they need to fix it, or provide a fully equivalent
> mechanism. Not bothering because it's too much like hard work is
> not an acceptable option.
>
> Ewan
|