Print

Print


On Thu, 19 Jan 2006, Steve Thorn wrote:

> Checkjob gives similar output for all blocked jobs I tried:
>
> # checkjob 35589
> checking job 35589
>
> State: Idle
> Creds:  user:lhcb003  group:lhcb  class:lhcb  qos:DEFAULT
> WallTime: 00:00:00 of 3:00:00:00
> SubmitTime: Mon Jan 16 13:28:45
>  (Time Queued  Total: 2:22:18:48  Eligible: 00:00:00)
>
> StartDate: -00:03:03  Thu Jan 19 11:44:30
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Holds:    Defer
> Messages:  exceeds available partition procs
> PE:  1.00  StartPriority:  628
> cannot select job 35589 for partition DEFAULT (job hold active)

Clearly something is amiss here - I have sometimes been able to shed some 
light on similar sitiations with "qstat -f 35589". I have seen cases where 
jobs have been allocated to nodes that are down/offline/awaiting repair.

In extremis, the 2 job files will have to be deleted from 
/var/spool/pbs/server_priv/jobs and the main pbs server restarted.

-- 
 				David Martin

Kelvin Building,
University of Glasgow,
Glasgow, G12 8QQ,
United Kingdom

tel:	(0)141 330 4197		 fax:	(0)141 330 5881
email:	[log in to unmask]