Dear All,
For some reason maui (or is it torque) on one CE just before 7am today
put about 100 jobs in "W" state assigned to WN but not running. And maui
(or is it torque?) is also shifting some W-state jobs around btw WN, still
in "W" state. (Our other torque/maui CE & its WN are just fine.)
Then it changes them back to Q state, but still assigned to a WN; then back
to W state.
Some WN have >40 jobs on them, 2 or 3 running, rest in W state.
Does anyone know how to fix this & force maui/torque to put the jobs back
into proper queued state & start running on the WN?
Just stop+restart torque+maui on CE didn't fix anything.
Versions: on ce:
root@lcgce03> egrep -i "torque|maui" /var/log/rpmpkgs
glite-TORQUE_server-3.1.7-0.i386.rpm
glite-TORQUE_utils-3.1.10-0.i386.rpm
glite-yaim-torque-server-4.0.1-5.noarch.rpm
glite-yaim-torque-utils-4.0.2-2.noarch.rpm
maui-3.2.6p20-snap.1182974819.8.slc4.i386.rpm
maui-client-3.2.6p20-snap.1182974819.8.slc4.i386.rpm
maui-server-3.2.6p20-snap.1182974819.8.slc4.i386.rpm
torque-2.3.0-snap.200801151629.2cri.slc4.i386.rpm
torque-client-2.3.0-snap.200801151629.2cri.slc4.i386.rpm
torque-server-2.3.0-snap.200801151629.2cri.slc4.i386.rpm
on some WN:
root@bse03> egrep -i "torque|maui" /var/log/rpmpkgs
glite-TORQUE_client-3.2.1-0.x86_64.rpm
glite-yaim-torque-client-4.0.1-1.noarch.rpm
torque-2.3.0-snap.200801151629.2cri.sl5.x86_64.rpm
torque-client-2.3.0-snap.200801151629.2cri.sl5.x86_64.rpm
torque-mom-2.3.0-snap.200801151629.2cri.sl5.x86_64.rpm
on other WN (built later)
root@bse04> egrep -i "torque|maui" /var/log/rpmpkgs
glite-TORQUE_client-3.2.1-0.x86_64.rpm
glite-yaim-torque-client-4.0.1-1.noarch.rpm
torque-2.3.6-2cri.el5.x86_64.rpm
torque-client-2.3.6-2cri.el5.x86_64.rpm
torque-mom-2.3.6-2cri.el5.x86_64.rpm
Everything was fine till ca.7am today! So I doubt problem = versions.
grepping maui.log re: one job:
06/16 06:57:48 MJobPReserve(102994,DEFAULT,ResCount,ResCountRej)
06/16 06:57:56 MRMJobStart(102994,Msg,SC)
06/16 06:57:56 MPBSJobStart(102994,base,Msg,SC)
06/16 06:57:56 MPBSJobModify(102994,Resource_List,Resource,bse05.phy.bris.ac.uk)
06/16 06:57:56 MPBSJobModify(102994,Resource_List,Resource,1)
06/16 06:57:56 WARNING: cannot set job '102994.lcgce03.phy.bris.ac.uk' attr
'Resource_List:neednodes' to '1' (rc: 15001 'Unknown Job Id')
06/16 06:57:56 INFO: job '102994' successfully started
06/16 06:58:07 INFO: job '102994' changed states from Running to Hold
06/16 07:35:58 INFO: job '102994' changed states from Hold to Idle
06/16 07:35:58 MJobPReserve(102994,DEFAULT,ResCount,ResCountRej)
Scrambling for pbs/maui commands to try to force job back to queued
state - if possible?...
Grateful for advice!
|