Steve Traylen wrote:
> On Fri, Mar 04, 2005 at 08:55:43PM +0100 or thereabouts, Maarten Litmaath wrote:
>
>>Steve Traylen wrote:
>>
>>
>>>Hi,
>>>
>>>I've not seen this for a while where all jobs submitted to RB
>>>remain in the "Waiting" state for ever.
>>
>
> Hi Maarten,
>
> This has happened again.
>
>>>It had apparently gone a away with recent version of resource broker
>>>code.
>>>
>>>I've restarted every service there making sure they really are dead but
>>>have had no joy.
>>
>>Jobs are waiting when the Workload Manager has not got to them yet,
>>meaning they still sit in /var/edgwl/workload_manager/input.fl.
>>
>>In the past this could happen when there was a deadlock on the file,
>>which is also used by the Network Server; check with:
>>
>> cat /proc/locks
>
>
> 1: POSIX ADVISORY WRITE 32199 03:02:1292115 0 EOF f1b57be0 c03a9448 c3775e24 00000000 f1b57bec
> 2: POSIX ADVISORY WRITE 2966 03:02:2093748 0 EOF c3775e20 f1b57be4 c3775fa4 00000000 c3775e2c
> 3: FLOCK ADVISORY WRITE 2768 03:02:2093745 0 EOF c3775fa0 c3775e24 c03a9448 00000000 c3775fac
>
> and
>
> # lslk
> SRC PID DEV INUM SZ TY M ST WH END LEN NAME
> (unknown) 2768 3,2 2093745 w 0 0 0 0 0 / (rootfs)
> atd 2966 3,2 2093748 5 w 0 0 0 0 0 /var/run/atd.pid
> condor_master 32199 3,2 1292115 0 w 0 0 0 0 0 /opt/condor/var/condor/log/InstanceLock
>
> Steve
>
>
>>Recently we have seen that the matchmaking can become very slow,
>>due to the BDII having a slapd cache size that is too small
>>(fixed in LCG-2_3_1): can you check that setting?
>>
>>Does /var/edgwl/workload_manager/log/events.log show activity?
So, what does /var/edgwl/workload_manager/log/events.log say?
I have attached a script "chk-wl.sh" that shows the state of affairs
on a single page.
|