Hi Eygene,
Form another job in the same situation:
[root@axon-g01 ~]# diagnose -j 5927
Name State Par Proc QOS WCLimit R Min User
Group Account QueuedTime Network Opsys Arch Mem Disk Procs
Class Features
5927 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm
ops - 00:52:11 [NONE] [NONE] [NONE] >=0 >=0 NC0
[ops:1] [NONE]
WARNING: job '5927' has failed to start 26 times
[root@axon-g01 ~]# diagnose -j 5927
Name State Par Proc QOS WCLimit R Min User
Group Account QueuedTime Network Opsys Arch Mem Disk Procs
Class Features
5927 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm
ops - 00:52:33 [NONE] [NONE] [NONE] >=0 >=0 NC0
[ops:1] [NONE]
WARNING: job '5927' has failed to start 26 times
[root@axon-g01 ~]# checkjob 5927
checking job 5927
State: Idle
Creds: user:opssgm group:ops class:ops qos:DEFAULT
WallTime: 00:00:11 of 3:00:00:00
SubmitTime: Tue Apr 22 20:16:33
(Time Queued Total: 1:20:32 Eligible: 00:00:00)
StartDate: -00:53:55 Tue Apr 22 20:43:10
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 26
PartitionMask: [ALL]
Holds: Batch (hold reason: RMFailure)
Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot
execute at specified host because of checkpoint or stagein files'
PE: 1.00 StartPriority: 52
cannot select job 5927 for partition DEFAULT (job hold active)
[root@axon-g01 ~]# tracejob 5927
/var/spool/pbs/mom_logs/20080422: No such file or directory
/var/spool/pbs/sched_logs/20080422: No such file or directory
Job: 5927.axon-g01.ieeta.pt
04/22/2008 20:16:33 S enqueuing into ops, state 1 hop 1
04/22/2008 20:16:33 S Job Queued at request of
[log in to unmask], owner =
[log in to unmask], job name = STDIN,
queue = ops
04/22/2008 20:16:33 A queue=ops
04/22/2008 20:16:34 S Job Modified at request of
[log in to unmask]
04/22/2008 20:16:34 S Job Run at request of [log in to unmask]
04/22/2008 20:16:34 S MOM rejected modify request, error: 15001
04/22/2008 20:16:46 S unable to run job, MOM rejected/rc=2
I have a reservation for ops Jobs in maui.cfg:
SRCFG[monitoring] CLASSLIST=ops
SRCFG[monitoring] PERIOD=INFINITY
SRCFG[monitoring] TASKCOUNT=1
SRCFG[monitoring] HOSTLIST=axon-g07.ieeta.pt
SRCFG[monitoring] RESOURCES=PROCS:1
SRCFG[monitoring] STARTTIME=0:00:00 ENDTIME=24:00:00
I do understand that the jobs are being rejected by the WN but I can't
find why.
>From the WN side ( /var/spool/pbs/mom_logs/20080422 ):
04/22/2008 20:16:34;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=axon-g07.ieeta.pt MSG=modify job
failed, unknown job 5927.axon-g01.ieeta.pt), aux=0, type=ModifyJob, from
[log in to unmask]
I'm a complete noob in PBS/maui systems... and I'm not finding usefull
information about these errors.
Thanks for your the reply,
Luís
Qua, 2008-04-23 às 00:02 +0400, Eygene Ryabinkin escreveu:
> Mon, Apr 21, 2008 at 04:40:07PM +0100, IEETA_Grid_initiative wrote:
> > But over the time some jobs are been locked in "BatchHold"
> [...]
> > BLOCKED JOBS----------------
> > JOBNAME USERNAME STATE PROC WCLIMIT
> > QUEUETIME
> >
> > 5643 opssgm Idle 1 3:00:00:00 Sat Apr 19
> > 15:17:25
> > 5644 opssgm Idle 1 3:00:00:00 Sat Apr 19
> > 15:18:23
> > 5647 opssgm BatchHold 1 3:00:00:00 Sat Apr 19
> > 17:32:03
> > 5652 opssgm BatchHold 1 3:00:00:00 Sat Apr 19
> > 22:16:16
> > 5657 opssgm BatchHold 1 3:00:00:00 Sun Apr 20
> > 04:47:59
> > 5658 opssgm BatchHold 1 3:00:00:00 Sun Apr 20
> > 05:16:58
> > 5659 opssgm BatchHold 1 3:00:00:00 Sun Apr 20
> > 06:55:26
> > 5663 opssgm BatchHold 1 3:00:00:00 Sun Apr 20
> > 12:19:06
> > 5666 opssgm BatchHold 1 3:00:00:00 Sun Apr 20
> > 19:03:02
> > 5669 opssgm BatchHold 1 3:00:00:00 Sun Apr 20
> > 21:26:07
> > 5673 opssgm BatchHold 1 3:00:00:00 Mon Apr 21
> > 00:18:05
> > 5675 bio003 BatchHold 1 3:00:00:00 Mon Apr 21
> > 02:55:15
>
> 'diagnose -j 5647' should tell you the reason for the hold. What
> will it be?
|