On Thu, 3 Jan 2008, Condurache, C (Catalin) wrote:
> Happy New Year! first of all, then back to our usual problems :(
>
> On 31 Dec some multiple cancelations occurred on
> lcgrb01.gridpp.rl.ac.uk, and about 22000+ jobs piled up on this machine
> without being dispatched. Yesterday I had to ban new jobs coming on
> lcgrb01, waiting for the rest to be processed.
> The rate of processing is ~10k jobs / 24hrs so I have to wait for
> another day or so.
Indeed, a cancellation request amounts to a non-trivial amount of work,
comparable to that for a job submission...
> [...]
>
> My question, is it something that can be done to correct things, or to
> speed up the clearing process?
If the WM input.fl only contains cancellation requests, or if you do not
mind zapping the (re)submissions that are present, you could follow steps
1 through 5 detailed here:
http://goc.grid.sinica.edu.tw/gocwiki/Jobs_sent_to_my_RB_stay_in_Waiting_state_forever
If there is a cancellation pile-up in the JC queue.fl, the corresponding
recipe would be this:
----------------------------------------------------------------------
1. Comment out the /etc/cron.d/edg-wl-check-daemons cron job.
2. Stop all daemons that deal with the queue.fl:
/etc/init.d/edg-wl-jc stop
/etc/init.d/edg-wl-lm stop
/etc/init.d/edg-wl-wm stop
3. Move the bad file out of the way:
mv /var/edgwl/jobcontrol/queue.fl \
/var/edgwl/jobcontrol/queue.fl.BAD
4. Restart the stopped daemons:
/etc/init.d/edg-wl-jc start
/etc/init.d/edg-wl-lm start
/etc/init.d/edg-wl-wm start
5. Uncomment the /etc/cron.d/edg-wl-check-daemons cron job.
----------------------------------------------------------------------
|