On Fri, Mar 04, 2005 at 08:55:43PM +0100 or thereabouts, Maarten Litmaath wrote:
> Steve Traylen wrote:
>
> >Hi,
> >
> > I've not seen this for a while where all jobs submitted to RB
> > remain in the "Waiting" state for ever.
Hi Maarten,
This has happened again.
> >
> > It had apparently gone a away with recent version of resource broker
> > code.
> >
> > I've restarted every service there making sure they really are dead but
> > have had no joy.
>
> Jobs are waiting when the Workload Manager has not got to them yet,
> meaning they still sit in /var/edgwl/workload_manager/input.fl.
>
> In the past this could happen when there was a deadlock on the file,
> which is also used by the Network Server; check with:
>
> cat /proc/locks
1: POSIX ADVISORY WRITE 32199 03:02:1292115 0 EOF f1b57be0 c03a9448 c3775e24 00000000 f1b57bec
2: POSIX ADVISORY WRITE 2966 03:02:2093748 0 EOF c3775e20 f1b57be4 c3775fa4 00000000 c3775e2c
3: FLOCK ADVISORY WRITE 2768 03:02:2093745 0 EOF c3775fa0 c3775e24 c03a9448 00000000 c3775fac
and
# lslk
SRC PID DEV INUM SZ TY M ST WH END LEN NAME
(unknown) 2768 3,2 2093745 w 0 0 0 0 0 / (rootfs)
atd 2966 3,2 2093748 5 w 0 0 0 0 0 /var/run/atd.pid
condor_master 32199 3,2 1292115 0 w 0 0 0 0 0 /opt/condor/var/condor/log/InstanceLock
Steve
>
> Recently we have seen that the matchmaking can become very slow,
> due to the BDII having a slapd cache size that is too small
> (fixed in LCG-2_3_1): can you check that setting?
>
> Does /var/edgwl/workload_manager/log/events.log show activity?
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|