Some more information on this.
We have monitored one of the WNs with a 'vanished' job, running it's
condor job. The job completed it's job run and started again with a new
job. Looking at our torque logs, the original job arrived 20/10, was
started, ran for a day, was requeued (!) for some reason on 22/10 and
completed exit status 0 later the same day having used its 72:00 hr wall
time. Thus any work thereafter is due to the job shutdown failing.
Now I don't propose to accuse anyone of cheating here, however, it is
clear that the job termination mechanism of these particular jobs should
be looked at.
Meanwhile, I propose to terminate any jobs running from 'vanished'
legitimate jobids.
Martin.
> -----Original Message-----
> From: LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Bly,
> MJ (Martin)
> Sent: 26 October 2006 13:32
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Problems with 'rogue' Condor
> processes on WNs
>
> All,
>
> The big problem we have encountered here is that for whatever
> reason, we
> have orphaned processes still acquiring 'condor' work after the
> Torque/Maui system thinks the 'pilot' job has gone away, and thus
> Torque/Maui starts another job in its place. We end up with
> 'overloaded' CPUs (that is, more than one job per execution core) and
> our alarm system notes this and alerts us to the 'vanished' work.
>
> I suspect I going to have to use David's cleanup script to
> try and trap
> this.
>
> Martin.
>
> > -----Original Message-----
> > From: LHC Computer Grid - Rollout
> > [mailto:[log in to unmask]] On Behalf Of Jeff Templon
> > Sent: 26 October 2006 12:25
> > To: [log in to unmask]
> > Subject: Re: [LCG-ROLLOUT] Problems with 'rogue' Condor
> > processes on WNs
> >
> > Hi Maxim,
> >
> > Maxim Kovgan wrote:
> >
> > > This is why gLite/LCG WN should not run condor_schedd, only
> > > condor_master and condor_startd: only CE shall be able to actually
> > > submit into the condor pool, and so you have control where it is
> > > submitted from, accounted etc.
> >
> > > Because CE submits to condor AFTER the user has been given
> > UID:GID acc.
> > > to the security context got from gt authentication/authorization
> > > mechanisms...
> > >
> > > And CE should run only condor_master and condor_schedd
> > >
> > > Is there any flaw in my line of thought ?
> >
> > I am not sure I completely understand it :-) Realize though
> > that what is
> > happening is that a user is building a virtual condor pool by
> > acquiring
> > WNs through standard grid techniques. From a site point of
> view, as
> > long as the user is doing this his/her self, it's fine -- I
> > mean that it
> > is a single user who submits all the grid jobs and also runs
> > the jobs in
> > the virtual condor pool, and does all this with her own Grid
> > X.509 cert.
> >
> > Our native system is torque/maui and this is not being
> > circumvented; the
> > condor deaemons are only started after the job is scheduled
> > by maui and
> > executed by torque.
> >
> > Does this make it more clear? We don't have condor at all
> as a batch
> > system, aside from the standard condor-G stuff in the lcg-CE.
> >
> > JT
> >
>
|