Hi, Here, *********************************************************** [root@ce pbs]# checkjob -v 8 checking job 8 (RM job '8.ce.prd.hp.com') State: Idle Creds: user:dteam001 group:dteam class:dteam qos:DEFAULT WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Fri Feb 18 03:20:40 (Time Queued Total: 3:09:34:28 Eligible: 00:00:00) StartDate: -3:09:08:01 Fri Feb 18 03:47:07 Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 NodeAccess: SHARED NodeCount: 0 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 25 PartitionMask: [ALL] SystemQueueTime: Fri Feb 18 03:48:12 Holds: Batch (hold reason: RMFailure) Messages: cannot start job - RM failure, rc: 15070, msg: 'Server could not connect to MOM' PE: 1.00 StartPriority: 4866 cannot select job 8 for partition DEFAULT (job hold active) [root@ce pbs]# ******************************************************************** I also tried restarting the MOM service (pbs_mom) but did not work :-(. Thanks again, ./MS -----Original Message----- From: LHC Computer Grid - Rollout on behalf of Steve Traylen Sent: Mon 2/21/2005 11:48 AM To: [log in to unmask] Subject: Re: [LCG-ROLLOUT] Torque stalled jobs On Mon, Feb 21, 2005 at 04:43:58PM -0000 or thereabouts, Burke, S (Stephen) wrote: > LHC Computer Grid - Rollout > > [mailto:[log in to unmask]] On Behalf Of Sotomayor, Maniel > said: > > I've been testing our torque installation. However, > > submitted jobs are kept in the 'Q' state for too much time. > > Meanwhile, ALL, working nodes report to be on the "free" state. > > What do you get from qstat -f one one of the waiting jobs? I would try checkjob 12345 and checkjob -v 12345 on the job number as well. Steve > > Stephen -- Steve Traylen [log in to unmask] http://www.gridpp.ac.uk/