On Mon, 4 Dec 2006, Jeremy Cook wrote:
> Daniele, I've seen other clues that might indicate that this is indeed my
> problem too. I'm not sure how to use this condor_rm command, is it something
> to execute on the CE or on the WMS?
>
> Jeremy
On the WMS, so only its admin can do that. This is a known bug that
is on the high-priority list.
> On 04/12/06, Daniele Cesini <[log in to unmask]> wrote:
> >
> > Hi Jeremy, I had the same problem. I do not known if you are in the
> > same case, but the solution that worked for my pre-production CE was to
> > remove (throught condor_rm) the condor launcher for my resource on the
> > WMS. When during the next submission the launcher is re-created the job
> > went fine.
> > As far as I understand, the condor launcher on the wms can become stuck
> > for various reasons, one of this is the manual killing on the glite ce
> > of the jobmanager (a reboot of the CE with the job-managers running can
> > have the same effect).
> > Cheers,
> > Daniele.
> >
> > Jeremy Cook wrote:
> > > Hi all,
> > >
> > > I've been struggling since the middle of last week to understand why
> > > our gLite CE node does not work consistently anymore, and it's driving
> > > me nuts!
> > >
> > > So far it seems to boil down to this, if I use the Northern ROC for my
> > > UI, as in:
> > >
> > > RB_HOST=g03n03.pdc.kth.se <http://g03n03.pdc.kth.se>
> > > WMS_HOST=g03n06.pdc.kth.se <http://g03n06.pdc.kth.se>
> > >
> > > in the side-info.def for my UI then submitted jobs reach the CE, no
> > > auth errors, and the job-manager starts running, however nothing is
> > > submitted to the WN, which are running on a seperate cluster.
> > >
> > > If I switch to:
> > >
> > > WMS_HOST=rb103.cern.ch <http://rb103.cern.ch>
> > > RB_HOST=glite-rb.scai.fraunhofer.de <http://glite-rb.scai.fraunhofer.de>
> > >
> > > in the UI config and rerun the config site script then submitted jobs
> > > reach the glite CE *and* get submitted to the WN. You would think this
> > > points to some sort of config error in the WMS at PDC, however there
> > > doesn't seem to be any sort of consistent pattern.
> > >
> > > I see a similar pattern from the log files for incoming atlas and bio
> > > jobs. Some are executed, others reach the CE but not the WN, and
> > > seemingly dependent on the "dispatching" WMS host (though not entirely
> > > consistently).
> > >
> > > Also looking through the gCE SAM results I see one or two sites with
> > > similar errors to us, but not in any way that I would say makes a
> > pattern.
> > > This error seems to be significant:
> > >
> > > - reason = Got a job held event, reason: "The job
> > > attribute PeriodicHold expression 'Matched =!= TRUE && CurrentTime >
> > > QDate + 900' evaluated to TRUE"
> > >
> > > But it seems that there may be many different reasons for getting such
> > > an error.
> > >
> > > Anyone any clue as to what is going on here and where the problem
> > > might lie?
> > >
> > > Jeremy
> > >
> > > --
> > > [log in to unmask]
> > > <mailto:[log in to unmask]> tlf: +47 55 58
> > > 40 65
> > > Parallab Bergen Centre for Computational Science
> >
>
>
>
>
|