I've sent Sanjay some data off-list.
> -----Original Message-----
> From: Sanjay Padhi [mailto:[log in to unmask]]
> Sent: 26 October 2006 17:21
> To: Bly, MJ (Martin)
> Subject: Re: [LCG-ROLLOUT] Problems with 'rogue' Condor
> processes on WNs
>
>
> Hi Martin,
>
> I did some investigation, I found few of my jobs at :
>
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
> Name = "[log in to unmask]"
>
> seems to loose their parents. Is the email below is directed
> towards one
> of those nodes ??. Traditionally if parent dies the child
> should go away.
> These are 9 cases where it is not out of several thousand jobs.
>
> It happened that even if the parent died, the processId owned by the
> parent is taken by someone else (any one) and then the child
> still thinks
> that the parent is alive .. so he lives. I really need more
> investigation.
>
> Regards,
>
> Sanjay
>
>
> On Thu, 26 Oct 2006, Bly, MJ (Martin) wrote:
>
> > Some more information on this.
> >
> > We have monitored one of the WNs with a 'vanished' job, running it's
> > condor job. The job completed it's job run and started
> again with a new
> > job. Looking at our torque logs, the original job arrived
> 20/10, was
> > started, ran for a day, was requeued (!) for some reason on
> 22/10 and
> > completed exit status 0 later the same day having used its
> 72:00 hr wall
> > time. Thus any work thereafter is due to the job shutdown failing.
> >
> > Now I don't propose to accuse anyone of cheating here,
> however, it is
> > clear that the job termination mechanism of these
> particular jobs should
> > be looked at.
> >
> > Meanwhile, I propose to terminate any jobs running from 'vanished'
> > legitimate jobids.
> >
> > Martin.
> >
> > > -----Original Message-----
> > > From: LHC Computer Grid - Rollout
> > > [mailto:[log in to unmask]] On Behalf Of Bly,
> > > MJ (Martin)
> > > Sent: 26 October 2006 13:32
> > > To: [log in to unmask]
> > > Subject: Re: [LCG-ROLLOUT] Problems with 'rogue' Condor
> > > processes on WNs
> > >
> > > All,
> > >
> > > The big problem we have encountered here is that for whatever
> > > reason, we
> > > have orphaned processes still acquiring 'condor' work after the
> > > Torque/Maui system thinks the 'pilot' job has gone away, and thus
> > > Torque/Maui starts another job in its place. We end up with
> > > 'overloaded' CPUs (that is, more than one job per
> execution core) and
> > > our alarm system notes this and alerts us to the 'vanished' work.
> > >
> > > I suspect I going to have to use David's cleanup script to
> > > try and trap
> > > this.
> > >
> > > Martin.
> > >
> > > > -----Original Message-----
> > > > From: LHC Computer Grid - Rollout
> > > > [mailto:[log in to unmask]] On Behalf Of
> Jeff Templon
> > > > Sent: 26 October 2006 12:25
> > > > To: [log in to unmask]
> > > > Subject: Re: [LCG-ROLLOUT] Problems with 'rogue' Condor
> > > > processes on WNs
> > > >
> > > > Hi Maxim,
> > > >
> > > > Maxim Kovgan wrote:
> > > >
> > > > > This is why gLite/LCG WN should not run condor_schedd, only
> > > > > condor_master and condor_startd: only CE shall be
> able to actually
> > > > > submit into the condor pool, and so you have control
> where it is
> > > > > submitted from, accounted etc.
> > > >
> > > > > Because CE submits to condor AFTER the user has been given
> > > > UID:GID acc.
> > > > > to the security context got from gt
> authentication/authorization
> > > > > mechanisms...
> > > > >
> > > > > And CE should run only condor_master and condor_schedd
> > > > >
> > > > > Is there any flaw in my line of thought ?
> > > >
> > > > I am not sure I completely understand it :-) Realize though
> > > > that what is
> > > > happening is that a user is building a virtual condor pool by
> > > > acquiring
> > > > WNs through standard grid techniques. From a site point of
> > > view, as
> > > > long as the user is doing this his/her self, it's fine -- I
> > > > mean that it
> > > > is a single user who submits all the grid jobs and also runs
> > > > the jobs in
> > > > the virtual condor pool, and does all this with her own Grid
> > > > X.509 cert.
> > > >
> > > > Our native system is torque/maui and this is not being
> > > > circumvented; the
> > > > condor deaemons are only started after the job is scheduled
> > > > by maui and
> > > > executed by torque.
> > > >
> > > > Does this make it more clear? We don't have condor at all
> > > as a batch
> > > > system, aside from the standard condor-G stuff in the lcg-CE.
> > > >
> > > > JT
> > > >
> > >
> >
>
|