


Been away, hence the delay in seeing this...

The RTM is limited to getting its information from events in the Loggining and Bookkeeping system.  If there is no event to change a job state then the job remains in that state according to the RTM...


I know there are various problems with this.  A crashed/rebooted CE with Running jobs will never report job completion on those jobs for example, and there are various other things that can go wrong as pointed out in this thread.

Therefore I crate maximum times for a job to be in a given state in the RTM before it is thrown out.  For Scheduled and Running states (which are potentially very long) this is a (possibly overkill) time of 2 weeks.  So any large discrepancies due to malfunctions could remain "reported" in the RTM for that time.  Jobs in Submitted/Ready/Waiting have a maximum time in the RTM of one day.



On Mon, 24 Sep 2007, Condurache, C (Catalin) wrote:

> Hi,
> We (and some users complained as well) are experiencing at RAL a problem
> with one of the RBs ( It seems that jobs are
> left indefinitely in the Running status, even if the SandboxDir gets
> eventually populated with the expected output (and error) information.
> The RBmonitoring framework does not report anything unusual
> (, but trying to
> compare with data from
> I
> found big difference between numbers of Running jobs for lcgrb01 (194 on
> RB framework vs 2859 on gridportal) and no differences between the
> numbers on the other two RBs we have at RAL (lcgrb02 263 vs 256, lcgrb03
> 107 vs 115)
> I have restarted the lcgrb01 machine but with no effect.
> Has anyone an idea here?
> Many thanks,
> Catalin Condurache
> Tier1 Grid Services Team RAL


Dr. Gidon Moont
High Energy Physics Group, The Blackett Laboratory
Imperial College London, South Kensington Campus
Prince Consort Road, LONDON SW7 2BW
Tel: +44 (0)207 594 7810