All,
The big problem we have encountered here is that for whatever reason, we
have orphaned processes still acquiring 'condor' work after the
Torque/Maui system thinks the 'pilot' job has gone away, and thus
Torque/Maui starts another job in its place. We end up with
'overloaded' CPUs (that is, more than one job per execution core) and
our alarm system notes this and alerts us to the 'vanished' work.
I suspect I going to have to use David's cleanup script to try and trap
this.
Martin.
> -----Original Message-----
> From: LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Jeff Templon
> Sent: 26 October 2006 12:25
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Problems with 'rogue' Condor
> processes on WNs
>
> Hi Maxim,
>
> Maxim Kovgan wrote:
>
> > This is why gLite/LCG WN should not run condor_schedd, only
> > condor_master and condor_startd: only CE shall be able to actually
> > submit into the condor pool, and so you have control where it is
> > submitted from, accounted etc.
>
> > Because CE submits to condor AFTER the user has been given
> UID:GID acc.
> > to the security context got from gt authentication/authorization
> > mechanisms...
> >
> > And CE should run only condor_master and condor_schedd
> >
> > Is there any flaw in my line of thought ?
>
> I am not sure I completely understand it :-) Realize though
> that what is
> happening is that a user is building a virtual condor pool by
> acquiring
> WNs through standard grid techniques. From a site point of view, as
> long as the user is doing this his/her self, it's fine -- I
> mean that it
> is a single user who submits all the grid jobs and also runs
> the jobs in
> the virtual condor pool, and does all this with her own Grid
> X.509 cert.
>
> Our native system is torque/maui and this is not being
> circumvented; the
> condor deaemons are only started after the job is scheduled
> by maui and
> executed by torque.
>
> Does this make it more clear? We don't have condor at all as a batch
> system, aside from the standard condor-G stuff in the lcg-CE.
>
> JT
>
|