On Thu, 19 Jan 2006, Steve Thorn wrote:
> Checkjob gives similar output for all blocked jobs I tried:
>
> # checkjob 35589
> checking job 35589
>
> State: Idle
> Creds: user:lhcb003 group:lhcb class:lhcb qos:DEFAULT
> WallTime: 00:00:00 of 3:00:00:00
> SubmitTime: Mon Jan 16 13:28:45
> (Time Queued Total: 2:22:18:48 Eligible: 00:00:00)
>
> StartDate: -00:03:03 Thu Jan 19 11:44:30
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 0
> PartitionMask: [ALL]
> Holds: Defer
> Messages: exceeds available partition procs
> PE: 1.00 StartPriority: 628
> cannot select job 35589 for partition DEFAULT (job hold active)
Clearly something is amiss here - I have sometimes been able to shed some
light on similar sitiations with "qstat -f 35589". I have seen cases where
jobs have been allocated to nodes that are down/offline/awaiting repair.
In extremis, the 2 job files will have to be deleted from
/var/spool/pbs/server_priv/jobs and the main pbs server restarted.
--
David Martin
Kelvin Building,
University of Glasgow,
Glasgow, G12 8QQ,
United Kingdom
tel: (0)141 330 4197 fax: (0)141 330 5881
email: [log in to unmask]
|