Print

Print


My problem is now solved. The gocwiki did not include this particular
cause of the PeriodicHold problem. The correct answer was to remove
hanging condor jobs from the queue on the WMS as Daniele had
indicated. I suspect that several WMS' still have hanging condor jobs
queued for Bergen, I will try to deduce which and get in touch.

This problem started when our disk partition filled up and jobs
started hanging in the WN queue.

Jeremy

On 05/12/06, Alessandro Paolini <[log in to unmask]> wrote:
>
>  Hi Jeremy,
>  maybe also this faq
> http://goc.grid.sinica.edu.tw/gocwiki/The_PeriodicHold_expression_'Matched_%3d%21%3d_TRUE_%26%26_CurrentTime_%3e_QDate_+_900'
>  can help you
>
>  Alessandro
>
>  Jeremy Cook ha scritto:
> Hi all,
>
>  I've been struggling since the middle of last week to understand why our
> gLite CE node does not work consistently anymore, and it's driving me nuts!
>
>  So far it seems to boil down to this, if I use the Northern ROC for my UI,
> as in:
>
>  RB_HOST=g03n03.pdc.kth.se
>  WMS_HOST=g03n06.pdc.kth.se
>
>  in the side-info.def for my UI then submitted jobs reach the CE, no auth
> errors, and the job-manager starts running, however nothing is submitted to
> the WN, which are running on a seperate cluster.
>
>  If I switch to:
>
>  WMS_HOST=rb103.cern.ch
>  RB_HOST=glite-rb.scai.fraunhofer.de
>
>  in the UI config and rerun the config site script then submitted jobs reach
> the glite CE *and* get submitted to the WN. You would think this points to
> some sort of config error in the WMS at PDC, however there doesn't seem to
> be any sort of consistent pattern.
>
>  I see a similar pattern from the log files for incoming atlas and bio jobs.
> Some are executed, others reach the CE but not the WN, and seemingly
> dependent on the "dispatching" WMS host (though not entirely consistently).
>
>  Also looking through the gCE SAM results I see one or two sites with
> similar errors to us, but not in any way that I would say makes a pattern.
>  This error seems to be significant:
>
>  - reason                  =    Got a job held event, reason: "The job
> attribute PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate +
> 900' evaluated to TRUE"
>
>  But it seems that there may be many different reasons for getting such an
> error.
>
>  Anyone any clue as to what is going on here and where the problem might
> lie?
>
>  Jeremy
>
>  --
>  [log in to unmask]                        tlf: +47 55 58 40 65
>  Parallab                  Bergen Centre for Computational
> Science
>
>
>  --
> Dr. Alessandro Paolini
> INFN - CNAF
> Viale Berti Pichat 6/2
> 40127 Bologna
> Italy
> tel: +39 051 6092723
> fax: +39 051 6092746
> ICQ: 192172027
> **********************
> "credo nel potere del riso e delle lacrime"
>  "come antidoto all'odio ed al terrore"
>  "un giorno senza un sorriso"
>  "è un giorno perso" >>> Charlie Chaplin
>


-- 
[log in to unmask]                        tlf: +47 55 58 40 65
Parallab                  Bergen Centre for Computational Science