Hi
thanks for the suggestions
we did increased the MaxStartups value in the sshd_config on our CE to
400 ( this is way too much but is just for staying in the safe zone )
and, at the same time, reduced the MaxAuthTries to 3
as the problem is only sporadic, we will monitor the system with this
new config and see how it behaves for a long period of time
we will keep you updated on this
thanks again
Cheers
Sergio :)
On 07, May 2008 01:39 PM, Steve Traylen <[log in to unmask]> wrote:
> 2008/5/7 IEETA_Grid_initiative <[log in to unmask]>:
> > Hi Sergio,
> >
> > So... any news about the 15001 error?
> >
> > think I'm having a similar problem here.
> >
> > from the WN logs (/var/spool/pbs/mom_logs/) I can see:
> >
> > 05/07/2008 08:25:00;0080; pbs_mom;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job
> > failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob,
> > from
> > [log in to unmask]
> >
> > the odd thing is that in 4 equal jobs submitted by me 3 have
> > concluded
> > successfully and 1 falls in "BatchHold" for eternity...
> >
>
> Sergio,
>
> How many job slots, typical # jobs do you have in the first instance?
>
> You may need to increase the number of permitted scp connections in
> sshd_config
> on your CE.
>
> On one of the apparently affected WNs can you run.
>
> # momctl -d 1
>
> You may be able to clear the stale job with a
>
> # momctl -c <jobid> -h <WN.example.org>
>
> I'm not sure if this will just delete the job though, I think it
> depends on the retry policy.
>
> http://scotgrid.blogspot.com/2007/04/intervention-at-edinburgh.html
> has a possible solution but it has never been confirmed if this is
> even related.
> It needs to be updated for new versions of torque as well.
>
> > Like the following ones:
> >
> > [luis@axon-g01 ~]$ showq
> > ACTIVE JOBS--------------------
> > JOBNAME USERNAME STATE PROC REMAINING
> > STARTTIME
> >
> > 6736 bio012 Running 1 2:22:44:42 Wed May 7
> > 10:20:39
> > 6737 bio012 Running 1 2:22:47:43 Wed May 7
> > 10:23:40
> >
> > 2 Active Jobs 2 of 6 Processors Active (33.33%)
> > 1 of 3 Nodes Active (33.33%)
> >
> > IDLE JOBS----------------------
> > JOBNAME USERNAME STATE PROC WCLIMIT
> > QUEUETIME
> >
> >
> > 0 Idle Jobs
> >
> > BLOCKED JOBS----------------
> > JOBNAME USERNAME STATE PROC WCLIMIT
> > QUEUETIME
> >
> > 6684 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> > 17:39:42
> > 6687 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> > 17:40:46
> > 6697 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> > 17:47:44
> > 6710 opssgm BatchHold 1 3:00:00:00 Tue May 6
> > 22:32:06
> > 6712 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> > 22:58:34
> > 6724 opssgm BatchHold 1 3:00:00:00 Wed May 7
> > 03:17:38
> > 6739 opssgm Deferred 1 3:00:00:00 Wed May 7
> > 11:30:00
> >
> > Total Jobs: 9 Active Jobs: 2 Idle Jobs: 0 Blocked Jobs: 7
> >
> > For instance:
> >
> > [root@axon-g01 ~]# checkjob 6710
> >
> >
> > checking job 6710
> >
> > State: Idle
> > Creds: user:opssgm group:ops class:ops qos:DEFAULT
> > WallTime: 00:00:16 of 3:00:00:00
> > SubmitTime: Tue May 6 22:32:06
> > (Time Queued Total: 13:05:59 Eligible: 00:00:00)
> >
> > StartDate: -12:39:13 Tue May 6 22:58:52
> > Total Tasks: 1
> >
> > Req[0] TaskCount: 1 Partition: ALL
> > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> > Opsys: [NONE] Arch: [NONE] Features: [NONE]
> > NodeCount: 1
> >
> >
> > IWD: [NONE] Executable: [NONE]
> > Bypass: 0 StartCount: 26
> > PartitionMask: [ALL]
> > Holds: Batch (hold reason: RMFailure)
> > Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot
> > execute at specified host because of checkpoint or stagein files'
> > PE: 1.00 StartPriority: 1000758
> > cannot select job 6710 for partition DEFAULT (job hold active)
> >
> > ...and...
> >
> > [root@axon-g01 ~]# diagnose -j 6710
> > Name State Par Proc QOS WCLimit R Min User
> > Group Account QueuedTime Network Opsys Arch Mem Disk Procs
> > Class Features
> >
> > 6710 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm
> > ops - 12:41:36 [NONE] [NONE] [NONE] >=0 >=0 NC0
> > [ops:1] [NONE]
> > WARNING: job '6710' has failed to start 26 times
> >
> >
> > [root@axon-g01 ~]# releasehold -a ALL; tail
> > -f /var/spool/pbs/server_logs/20080507
> >
> > job holds adjusted
> >
> > (...)
> >
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request
> > received from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request
> > received
> > from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> > received
> > from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
> > Modified at request of [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request
> > received
> > from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15057(Cannot execute at specified host because of checkpoint or
> > stagein files), aux=0, type=RunJob, from [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> > received
> > from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
> > Modified at request of [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> > received
> > from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
> > Modified at request of [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request
> > received
> > from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15057(Cannot execute at specified host because of checkpoint or
> > stagein files), aux=0, type=RunJob, from [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> > received
> > from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
> > Modified at request of [log in to unmask]
> > 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser
> > request
> > received from [log in to unmask], sock=12
> > 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request
> > received
> > from [log in to unmask], sock=10
> > 05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler
> > sent
> > command time
> > 05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request
> > received from [log in to unmask], sock=9
> >
> >
> > (...)
> >
> >
> > This problem is driving me crazy for weeks...
> >
> > - don't matter if I turn the firewall off
> > - CE/WN connection exists
> > - I have reinstalled the WNs metapackages
> >
> > Please, any hint will be precious!
> >
> > P.S. - Sorry for the long mail.
> >
> > Thanks,
> >
> > Luís
> >
> >
> >
> >
> > On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote:
> > > Hi Mario
> > >
> > > could you detail what problem these two nodes are having ?
> > > we are experiencing a similar problem, except that it is not
> > > systematic
> > >
> > > basically we are observing spooradics
> > > "MOM rejected modify request, error: 15001"
> > > messages; sometimes the job get started anyway
> > > some other times it fails the stagin operation
> > > then the job is sent back to the server and is placed in Q state,
> > > but
> > > then maui does not select it anymore.
> > >
> > > we had two period of time during last weekend when we did observer
> > > the
> > > Globus error 94, and we wander whether the two things are
> > > correlated
> > > with each other or not
> > >
> > > Cheers
> > > Sergio :)
> > >
> > > On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]>
> > > wrote:
> > >
> > > >Well actually we may have figured out the problem. It seems two
> > > >workernodes had problems with stageout, but not something one
> > > >would
> > > >notice immediately out of hand. We have isolated them and now SAM
> > > >tests seem to be running fine (but we'll have to wait a bit
> > > >longer to
> > > >make sure this was the problem indeed). We also ran a separate
> > > >test of
> > > >a job on one of that workernodes and the logging information came
> > > >back
> > > >with exactly the known error so we do hope we have isolated it
> > > >now. We
> > > >will know in about 24h if all the SAM tests run through nicely.
> > > >
> > > >Mario
> > > >
> > > >On May 5, 2008, at 3:36 PM, <[log in to unmask]>
> > > ><[log in to unmask]
> > > >>wrote:
> > > >
> > > >>Hi Mario, Ilja,
> > > >>
> > > >>>>Anyway, the exact details are available from this ggus ticket:
> > > >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655
> > > >>>>
> > > >>>>I have increased the maxproc settings of both "marshal"s as it
> > > >>>>seemed to be somehow related to the error ( Globus error 94:
> > > >>>>the
> > > >>>>jobmanager does not accept any new requests (shutting down)),
> > > >>>>will
> > > >>>>see if it helps.
> > > >>>>
> > > >>>>Any other ideas are still very welcome!
> > > >>
> > > >>It appears that the failing jobs were in fact successfully
> > > >>submitted
> > > >>to Torque. For example, in /opt/edg/var/gatekeeper/grid-
> > > >>jobmap_20080505
> > > >>(spaces replaced with newlines for clarity):
> > > >>
> > > >>"localUser=11860"
> > > >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
> > > >>CN=582979/CN=Judit Novak"
> > > >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
> > > >>"userFQAN=/ops/Role=NULL/Capability=NULL"
> > > >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
> > > >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
> > > >>"lrmsID=42444.oberon.hep.kbfi.ee"
> > > >>"timestamp=2008-05-05 10:07:18"
> > > >>
> > > >>The job then may have been reported in such a way that the
> > > >>lcgpbs job
> > > >>manager considered the job as having failed. For example, the
> > > >>'W'
> > > >>state
> > > >>is treated like that. In that case you would see a cancellation
> > > >>(qdel)
> > > >>request in the Torque logs. Can you check what happened to job
> > > >>42444?
> > > >>
> > >
> > >
> > >
> > > Cheers
> > > Sergio :)
> > >
> > > ---------------------------------------------
> > > Dr. Sergio Maffioletti
> > >
> > > Grid Group
> > > CSCS, Swiss National Supercomputing Centre
> > > Via Cantonale
> > > CH-6928 Manno
> > > Tel: +41916108218
> > > Fax: +41916108282
> > > email: [log in to unmask]
> > > ---------------------------------------------
> >
>
>
>
> --
> Steve Traylen
Cheers
Sergio :)
---------------------------------------------
Dr. Sergio Maffioletti
Grid Group
CSCS, Swiss National Supercomputing Centre
Via Cantonale
CH-6928 Manno
Tel: +41916108218
Fax: +41916108282
email: [log in to unmask]
---------------------------------------------
|