2008/5/7 IEETA_Grid_initiative <[log in to unmask]>:
> Hi Sergio,
>
> So... any news about the 15001 error?
>
> think I'm having a similar problem here.
>
> from the WN logs (/var/spool/pbs/mom_logs/) I can see:
>
> 05/07/2008 08:25:00;0080; pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job
> failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob, from
> [log in to unmask]
>
> the odd thing is that in 4 equal jobs submitted by me 3 have concluded
> successfully and 1 falls in "BatchHold" for eternity...
>
Sergio,
How many job slots, typical # jobs do you have in the first instance?
You may need to increase the number of permitted scp connections in sshd_config
on your CE.
On one of the apparently affected WNs can you run.
# momctl -d 1
You may be able to clear the stale job with a
# momctl -c <jobid> -h <WN.example.org>
I'm not sure if this will just delete the job though, I think it
depends on the retry policy.
http://scotgrid.blogspot.com/2007/04/intervention-at-edinburgh.html
has a possible solution but it has never been confirmed if this is even related.
It needs to be updated for new versions of torque as well.
> Like the following ones:
>
> [luis@axon-g01 ~]$ showq
> ACTIVE JOBS--------------------
> JOBNAME USERNAME STATE PROC REMAINING
> STARTTIME
>
> 6736 bio012 Running 1 2:22:44:42 Wed May 7
> 10:20:39
> 6737 bio012 Running 1 2:22:47:43 Wed May 7
> 10:23:40
>
> 2 Active Jobs 2 of 6 Processors Active (33.33%)
> 1 of 3 Nodes Active (33.33%)
>
> IDLE JOBS----------------------
> JOBNAME USERNAME STATE PROC WCLIMIT
> QUEUETIME
>
>
> 0 Idle Jobs
>
> BLOCKED JOBS----------------
> JOBNAME USERNAME STATE PROC WCLIMIT
> QUEUETIME
>
> 6684 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> 17:39:42
> 6687 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> 17:40:46
> 6697 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> 17:47:44
> 6710 opssgm BatchHold 1 3:00:00:00 Tue May 6
> 22:32:06
> 6712 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> 22:58:34
> 6724 opssgm BatchHold 1 3:00:00:00 Wed May 7
> 03:17:38
> 6739 opssgm Deferred 1 3:00:00:00 Wed May 7
> 11:30:00
>
> Total Jobs: 9 Active Jobs: 2 Idle Jobs: 0 Blocked Jobs: 7
>
> For instance:
>
> [root@axon-g01 ~]# checkjob 6710
>
>
> checking job 6710
>
> State: Idle
> Creds: user:opssgm group:ops class:ops qos:DEFAULT
> WallTime: 00:00:16 of 3:00:00:00
> SubmitTime: Tue May 6 22:32:06
> (Time Queued Total: 13:05:59 Eligible: 00:00:00)
>
> StartDate: -12:39:13 Tue May 6 22:58:52
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> NodeCount: 1
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 26
> PartitionMask: [ALL]
> Holds: Batch (hold reason: RMFailure)
> Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot
> execute at specified host because of checkpoint or stagein files'
> PE: 1.00 StartPriority: 1000758
> cannot select job 6710 for partition DEFAULT (job hold active)
>
> ...and...
>
> [root@axon-g01 ~]# diagnose -j 6710
> Name State Par Proc QOS WCLimit R Min User
> Group Account QueuedTime Network Opsys Arch Mem Disk Procs
> Class Features
>
> 6710 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm
> ops - 12:41:36 [NONE] [NONE] [NONE] >=0 >=0 NC0
> [ops:1] [NONE]
> WARNING: job '6710' has failed to start 26 times
>
>
> [root@axon-g01 ~]# releasehold -a ALL; tail
> -f /var/spool/pbs/server_logs/20080507
>
> job holds adjusted
>
> (...)
>
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request
> received from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request received
> from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
> from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
> Modified at request of [log in to unmask]
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
> from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
> code=15057(Cannot execute at specified host because of checkpoint or
> stagein files), aux=0, type=RunJob, from [log in to unmask]
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
> from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
> Modified at request of [log in to unmask]
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
> from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
> Modified at request of [log in to unmask]
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
> from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
> code=15057(Cannot execute at specified host because of checkpoint or
> stagein files), aux=0, type=RunJob, from [log in to unmask]
> 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
> from [log in to unmask], sock=9
> 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
> Modified at request of [log in to unmask]
> 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from [log in to unmask], sock=12
> 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request received
> from [log in to unmask], sock=10
> 05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler sent
> command time
> 05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request
> received from [log in to unmask], sock=9
>
>
> (...)
>
>
> This problem is driving me crazy for weeks...
>
> - don't matter if I turn the firewall off
> - CE/WN connection exists
> - I have reinstalled the WNs metapackages
>
> Please, any hint will be precious!
>
> P.S. - Sorry for the long mail.
>
> Thanks,
>
> Luís
>
>
>
>
> On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote:
> > Hi Mario
> >
> > could you detail what problem these two nodes are having ?
> > we are experiencing a similar problem, except that it is not systematic
> >
> > basically we are observing spooradics
> > "MOM rejected modify request, error: 15001"
> > messages; sometimes the job get started anyway
> > some other times it fails the stagin operation
> > then the job is sent back to the server and is placed in Q state, but
> > then maui does not select it anymore.
> >
> > we had two period of time during last weekend when we did observer the
> > Globus error 94, and we wander whether the two things are correlated
> > with each other or not
> >
> > Cheers
> > Sergio :)
> >
> > On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> wrote:
> >
> > >Well actually we may have figured out the problem. It seems two
> > >workernodes had problems with stageout, but not something one would
> > >notice immediately out of hand. We have isolated them and now SAM
> > >tests seem to be running fine (but we'll have to wait a bit longer to
> > >make sure this was the problem indeed). We also ran a separate test of
> > >a job on one of that workernodes and the logging information came back
> > >with exactly the known error so we do hope we have isolated it now. We
> > >will know in about 24h if all the SAM tests run through nicely.
> > >
> > >Mario
> > >
> > >On May 5, 2008, at 3:36 PM, <[log in to unmask]>
> > ><[log in to unmask]
> > >>wrote:
> > >
> > >>Hi Mario, Ilja,
> > >>
> > >>>>Anyway, the exact details are available from this ggus ticket:
> > >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655
> > >>>>
> > >>>>I have increased the maxproc settings of both "marshal"s as it
> > >>>>seemed to be somehow related to the error ( Globus error 94: the
> > >>>>jobmanager does not accept any new requests (shutting down)), will
> > >>>>see if it helps.
> > >>>>
> > >>>>Any other ideas are still very welcome!
> > >>
> > >>It appears that the failing jobs were in fact successfully submitted
> > >>to Torque. For example, in /opt/edg/var/gatekeeper/grid-
> > >>jobmap_20080505
> > >>(spaces replaced with newlines for clarity):
> > >>
> > >>"localUser=11860"
> > >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
> > >>CN=582979/CN=Judit Novak"
> > >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
> > >>"userFQAN=/ops/Role=NULL/Capability=NULL"
> > >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
> > >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
> > >>"lrmsID=42444.oberon.hep.kbfi.ee"
> > >>"timestamp=2008-05-05 10:07:18"
> > >>
> > >>The job then may have been reported in such a way that the lcgpbs job
> > >>manager considered the job as having failed. For example, the 'W'
> > >>state
> > >>is treated like that. In that case you would see a cancellation
> > >>(qdel)
> > >>request in the Torque logs. Can you check what happened to job 42444?
> > >>
> >
> >
> >
> > Cheers
> > Sergio :)
> >
> > ---------------------------------------------
> > Dr. Sergio Maffioletti
> >
> > Grid Group
> > CSCS, Swiss National Supercomputing Centre
> > Via Cantonale
> > CH-6928 Manno
> > Tel: +41916108218
> > Fax: +41916108282
> > email: [log in to unmask]
> > ---------------------------------------------
>
--
Steve Traylen
|