Hi,
Here,
***********************************************************
[root@ce pbs]# checkjob -v 8
checking job 8 (RM job '8.ce.prd.hp.com')
State: Idle
Creds: user:dteam001 group:dteam class:dteam qos:DEFAULT
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Feb 18 03:20:40
(Time Queued Total: 3:09:34:28 Eligible: 00:00:00)
StartDate: -3:09:08:01 Fri Feb 18 03:47:07
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 0
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 25
PartitionMask: [ALL]
SystemQueueTime: Fri Feb 18 03:48:12
Holds: Batch (hold reason: RMFailure)
Messages: cannot start job - RM failure, rc: 15070, msg: 'Server could not connect to MOM'
PE: 1.00 StartPriority: 4866
cannot select job 8 for partition DEFAULT (job hold active)
[root@ce pbs]#
********************************************************************
I also tried restarting the MOM service (pbs_mom) but did not work :-(.
Thanks again,
./MS
-----Original Message-----
From: LHC Computer Grid - Rollout on behalf of Steve Traylen
Sent: Mon 2/21/2005 11:48 AM
To: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] Torque stalled jobs
On Mon, Feb 21, 2005 at 04:43:58PM -0000 or thereabouts, Burke, S (Stephen) wrote:
> LHC Computer Grid - Rollout
> > [mailto:[log in to unmask]] On Behalf Of Sotomayor, Maniel
> said:
> > I've been testing our torque installation. However,
> > submitted jobs are kept in the 'Q' state for too much time.
> > Meanwhile, ALL, working nodes report to be on the "free" state.
>
> What do you get from qstat -f one one of the waiting jobs?
I would try
checkjob 12345
and
checkjob -v 12345
on the job number as well.
Steve
>
> Stephen
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|