Maarten Litmaath wrote:
> Steve Traylen wrote:
>
>> Hi,
>>
>> I've not seen this for a while where all jobs submitted to RB
>> remain in the "Waiting" state for ever.
>>
>> It had apparently gone a away with recent version of resource broker
>> code.
>>
>> I've restarted every service there making sure they really are dead but
>> have had no joy.
>
>
> Jobs are waiting when the Workload Manager has not got to them yet,
> meaning they still sit in /var/edgwl/workload_manager/input.fl.
>
> In the past this could happen when there was a deadlock on the file,
> which is also used by the Network Server; check with:
>
> cat /proc/locks
>
> Recently we have seen that the matchmaking can become very slow,
> due to the BDII having a slapd cache size that is too small
> (fixed in LCG-2_3_1): can you check that setting?
>
> Does /var/edgwl/workload_manager/log/events.log show activity?
It may also be that the WM is continually restarting due to the
input.fl getting corrupted. That very thing just happened today
on the testzone RB lxn1188.cern.ch: from 10:03 until about now
the WM did not do *any* work... The bad input.fl got deleted
from the directory, but the *NS* was *not* restarted, so it merrily
continued writing new jobs into the deleted file... :-(
In any case we again have a bug in the handling of the input.fl...
|