Hello guys,
I'm trying to delete an old job from OpenPBS. It has been running
for a long time and it seems to be stalled. Thousands of jobs are enqueued
and stopped because it seems that OpenPBS is trying to drain the server.
When I try to delete it says something like:
[root@ce root]# qdel 14239
qdel: Server could not connect to MOM 14239
[root@ce root]#
All other jobs appear with a status of 'Q' with a message like:
[root@ce root]# qstat -nsR
ce.prd.hp.com:
Req'd Req'd Elap
Job ID Username Queue NDS TSK Memory Time S Time BIG FAST
PFS
--------------- -------- -------- --- --- ------ ----- - ----- ----- -----
14239.ce.prd.hp dteam001 short 1 -- -- 00:15 R -- -- --
--
bh-wn27
Job started on Thu Nov 25 at 00:40
20123.ce.prd.hp dteam001 short 1 -- -- 00:15 Q -- -- --
--
bl-wn17
Not Running: Draining system to allow starving job to run
I let the job 14239 enqueued because it seemed to be still running but now
I'm not so sure.
I have thousands of jobs enqueued waiting to be executed and most of my
nodes are marked as "free" or "busy".
Any hint please?
|