Print

Print


Hi,

Here, 
***********************************************************
[root@ce pbs]# checkjob -v 8

checking job 8 (RM job '8.ce.prd.hp.com')

State: Idle
Creds:  user:dteam001  group:dteam  class:dteam  qos:DEFAULT
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Feb 18 03:20:40
  (Time Queued  Total: 3:09:34:28  Eligible: 00:00:00)

StartDate: -3:09:08:01  Fri Feb 18 03:47:07
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 0


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 25
PartitionMask: [ALL]
SystemQueueTime: Fri Feb 18 03:48:12

Holds:    Batch  (hold reason:  RMFailure)
Messages:  cannot start job - RM failure, rc: 15070, msg: 'Server could not connect to MOM'
PE:  1.00  StartPriority:  4866
cannot select job 8 for partition DEFAULT (job hold active)

[root@ce pbs]#                                    
********************************************************************

I also tried restarting the MOM service (pbs_mom) but did not work :-(.

Thanks again,
./MS

-----Original Message-----
From: LHC Computer Grid - Rollout on behalf of Steve Traylen
Sent: Mon 2/21/2005 11:48 AM
To: [log in to unmask]
Subject:      Re: [LCG-ROLLOUT] Torque stalled jobs
 
On Mon, Feb 21, 2005 at 04:43:58PM -0000 or thereabouts, Burke, S (Stephen) wrote:
> LHC Computer Grid - Rollout
> > [mailto:[log in to unmask]] On Behalf Of Sotomayor, Maniel
> said:
> >   I've been testing our torque installation. However,
> > submitted jobs are kept in the 'Q' state for too much time.
> > Meanwhile, ALL, working nodes report to be on the "free" state.
>
> What do you get from qstat -f one one of the waiting jobs?

I would try

checkjob 12345
and
checkjob -v 12345

on the job number as well.

 Steve
>
> Stephen

--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/