Hi,
I'm coming back wrt RAL RB lcgrb01 problem.
Yesterday at some point I noticed with 'top' a 50%CPU on behalf of
edg-wl-renewd. Also (and more worrying) the
/var/edgwl/workload_manager/input.fl file had an old timestamp (~90
mins). All edg-wl-* services were reported to be running.
I restarted the edg-wl-proxyrenewal service and suddenly the input.fl
file started to be refreshed and edg-wl-renewd got a lower CPU use. And
the backlog of few hundred jobs disappeared in several hours.
Is this a known issue or the problem resided elsewhere?
Regards,
Catalin
> -----Original Message-----
> From: LHC Computer Grid - Rollout [mailto:LCG-
> [log in to unmask]] On Behalf Of Maarten Litmaath
> Sent: 24 September 2007 17:25
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Jobs on Running status forever on lcg-RB
>
> Takahashi, Maiko wrote:
>
> > Hi Catalin,
> >
> > I am running CMS MC production jobs via RAL RB, and I have been
seeing
> > strange behaviour in the last one week or so. I could be
contributing to
> > those jobs left in running status for long, and was wondering if I
could
> > get some help.
> >
> > It seems that quite a large fraction of jobs hang at "Ready" state
for a
> > long time. An example of the grid status is as follows.
> >
> > -bash-3.00$ edg-job-status
> > https://lcgrb01.gridpp.rl.ac.uk:9000/y1UXIL-IGpXTdIbT5Hk-XA
> > *************************************************************
> > BOOKKEEPING INFORMATION:
> > Status info for the Job :
> > https://lcgrb01.gridpp.rl.ac.uk:9000/y1UXIL-IGpXTdIbT5Hk-XA
> > Current Status: Ready
> > Status Reason: unavailable
> > Destination: ce00.hep.ph.ic.ac.uk:2119/jobmanager-sge-72hr
> > reached on: Fri Sep 21 09:16:10 2007
> > *************************************************************
>
> Possible explanations:
>
> 1. A big backlog. Catalin, is the LogMonitor running OK?
> Check that it is not crashing all the time.
>
> 2. Some MySQL table reached a maximum size or became corrupted,
> or the /var/lib/mysql file system became full.
|