Hi John
It is a longstanding issue with CREAM/PBS and Stephen opened a detailed ticket (https://ggus.eu/tech/ticket_show.php?ticket=72506 ) but it is not fixed yet. At Oxford we regularly kill jobs which are either in W state or in Q state but assigned to a WN.
for job in $(qstat | grep " Q " | cut -d. -f1) ; do if ( qstat -f ${job} | grep exec >>/dev/null) ; then qdel -p ${job} ; fi ; done
It will kill any job which is in Q state but assigned to a WN.
One of the issue we have noticed is that some time jobs from lower priority VO/users has to stay in queue for long enough to get its proxy expired and CREAM doesn't handle this situation properly.
Cheers
Kashif
-----Original Message-----
From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of John Hill
Sent: 13 June 2012 16:37
To: [log in to unmask]
Subject: Cleaning up the PBS/Torque queues
While investigating the recent supposed CVMFS and analysis job issues at
Cambridge, I came across PBS errors in /var/log/messages on the WNs
which reported copy errors when getting files from the CREAM Sandbox
area. Further digging has identified these as old pilot jobs (some from
August last year!) which are still lurking in the PBS queue and are
being periodically restarted. "showq" indicates that we have about 3500
of these relic jobs.I was wondering whether there was there a
recommended way to tidy up the queue?
John
|