Hi Sergio,
So... any news about the 15001 error?
think I'm having a similar problem here.
from the WN logs (/var/spool/pbs/mom_logs/) I can see:
05/07/2008 08:25:00;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job
failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob, from
[log in to unmask]
the odd thing is that in 4 equal jobs submitted by me 3 have concluded
successfully and 1 falls in "BatchHold" for eternity...
Like the following ones:
[luis@axon-g01 ~]$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
6736 bio012 Running 1 2:22:44:42 Wed May 7
10:20:39
6737 bio012 Running 1 2:22:47:43 Wed May 7
10:23:40
2 Active Jobs 2 of 6 Processors Active (33.33%)
1 of 3 Nodes Active (33.33%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
6684 dteam004 BatchHold 1 3:00:00:00 Tue May 6
17:39:42
6687 dteam004 BatchHold 1 3:00:00:00 Tue May 6
17:40:46
6697 dteam004 BatchHold 1 3:00:00:00 Tue May 6
17:47:44
6710 opssgm BatchHold 1 3:00:00:00 Tue May 6
22:32:06
6712 dteam004 BatchHold 1 3:00:00:00 Tue May 6
22:58:34
6724 opssgm BatchHold 1 3:00:00:00 Wed May 7
03:17:38
6739 opssgm Deferred 1 3:00:00:00 Wed May 7
11:30:00
Total Jobs: 9 Active Jobs: 2 Idle Jobs: 0 Blocked Jobs: 7
For instance:
[root@axon-g01 ~]# checkjob 6710
checking job 6710
State: Idle
Creds: user:opssgm group:ops class:ops qos:DEFAULT
WallTime: 00:00:16 of 3:00:00:00
SubmitTime: Tue May 6 22:32:06
(Time Queued Total: 13:05:59 Eligible: 00:00:00)
StartDate: -12:39:13 Tue May 6 22:58:52
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 26
PartitionMask: [ALL]
Holds: Batch (hold reason: RMFailure)
Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot
execute at specified host because of checkpoint or stagein files'
PE: 1.00 StartPriority: 1000758
cannot select job 6710 for partition DEFAULT (job hold active)
...and...
[root@axon-g01 ~]# diagnose -j 6710
Name State Par Proc QOS WCLimit R Min User
Group Account QueuedTime Network Opsys Arch Mem Disk Procs
Class Features
6710 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm
ops - 12:41:36 [NONE] [NONE] [NONE] >=0 >=0 NC0
[ops:1] [NONE]
WARNING: job '6710' has failed to start 26 times
[root@axon-g01 ~]# releasehold -a ALL; tail
-f /var/spool/pbs/server_logs/20080507
job holds adjusted
(...)
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request
received from [log in to unmask], sock=9
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
code=15057(Cannot execute at specified host because of checkpoint or
stagein files), aux=0, type=RunJob, from [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
code=15057(Cannot execute at specified host because of checkpoint or
stagein files), aux=0, type=RunJob, from [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser request
received from [log in to unmask], sock=12
05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request received
from [log in to unmask], sock=10
05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler sent
command time
05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request
received from [log in to unmask], sock=9
(...)
This problem is driving me crazy for weeks...
- don't matter if I turn the firewall off
- CE/WN connection exists
- I have reinstalled the WNs metapackages
Please, any hint will be precious!
P.S. - Sorry for the long mail.
Thanks,
Luís
On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote:
> Hi Mario
>
> could you detail what problem these two nodes are having ?
> we are experiencing a similar problem, except that it is not systematic
>
> basically we are observing spooradics
> "MOM rejected modify request, error: 15001"
> messages; sometimes the job get started anyway
> some other times it fails the stagin operation
> then the job is sent back to the server and is placed in Q state, but
> then maui does not select it anymore.
>
> we had two period of time during last weekend when we did observer the
> Globus error 94, and we wander whether the two things are correlated
> with each other or not
>
> Cheers
> Sergio :)
>
> On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> wrote:
>
> >Well actually we may have figured out the problem. It seems two
> >workernodes had problems with stageout, but not something one would
> >notice immediately out of hand. We have isolated them and now SAM
> >tests seem to be running fine (but we'll have to wait a bit longer to
> >make sure this was the problem indeed). We also ran a separate test of
> >a job on one of that workernodes and the logging information came back
> >with exactly the known error so we do hope we have isolated it now. We
> >will know in about 24h if all the SAM tests run through nicely.
> >
> >Mario
> >
> >On May 5, 2008, at 3:36 PM, <[log in to unmask]>
> ><[log in to unmask]
> >>wrote:
> >
> >>Hi Mario, Ilja,
> >>
> >>>>Anyway, the exact details are available from this ggus ticket:
> >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655
> >>>>
> >>>>I have increased the maxproc settings of both "marshal"s as it
> >>>>seemed to be somehow related to the error ( Globus error 94: the
> >>>>jobmanager does not accept any new requests (shutting down)), will
> >>>>see if it helps.
> >>>>
> >>>>Any other ideas are still very welcome!
> >>
> >>It appears that the failing jobs were in fact successfully submitted
> >>to Torque. For example, in /opt/edg/var/gatekeeper/grid-
> >>jobmap_20080505
> >>(spaces replaced with newlines for clarity):
> >>
> >>"localUser=11860"
> >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
> >>CN=582979/CN=Judit Novak"
> >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
> >>"userFQAN=/ops/Role=NULL/Capability=NULL"
> >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
> >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
> >>"lrmsID=42444.oberon.hep.kbfi.ee"
> >>"timestamp=2008-05-05 10:07:18"
> >>
> >>The job then may have been reported in such a way that the lcgpbs job
> >>manager considered the job as having failed. For example, the 'W'
> >>state
> >>is treated like that. In that case you would see a cancellation
> >>(qdel)
> >>request in the Torque logs. Can you check what happened to job 42444?
> >>
>
>
>
> Cheers
> Sergio :)
>
> ---------------------------------------------
> Dr. Sergio Maffioletti
>
> Grid Group
> CSCS, Swiss National Supercomputing Centre
> Via Cantonale
> CH-6928 Manno
> Tel: +41916108218
> Fax: +41916108282
> email: [log in to unmask]
> ---------------------------------------------
|