JISCMail - LCG-ROLLOUT Archives

Hi

thanks for the suggestions

we did increased the MaxStartups value in the sshd_config on our CE to
400 ( this is way too much but is just for staying in the safe zone )
and, at the same time, reduced the MaxAuthTries to 3

as the problem is only sporadic, we will monitor the system with this
new config and see how it behaves for a long period of time

we will keep you updated on this

thanks again

Cheers
Sergio :)

On 07, May 2008 01:39 PM, Steve Traylen <[log in to unmask]> wrote:

> 2008/5/7 IEETA_Grid_initiative <[log in to unmask]>:
> > Hi Sergio,
> >
> >  So... any news about the 15001 error?
> >
> >  think I'm having a similar problem here.
> >
> >  from the WN logs (/var/spool/pbs/mom_logs/) I can see:
> >
> >  05/07/2008 08:25:00;0080;   pbs_mom;Req;req_reject;Reject reply
> >  code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job
> > failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob,
> >  from
> >  [log in to unmask]
> >
> > the odd thing is that in 4 equal jobs submitted by me 3 have
> >  concluded
> >  successfully and 1 falls in "BatchHold" for eternity...
> >
> 
> Sergio,
> 
> How many job slots, typical # jobs do you have in the first instance?
> 
> You may need to increase the number of permitted scp connections in
> sshd_config
> on your CE.
> 
> On one of the apparently affected WNs can you run.
> 
> # momctl -d 1
> 
> You may be able to clear the stale job with a
> 
> # momctl -c <jobid> -h <WN.example.org>
> 
> I'm not sure if this will just delete the job though, I think it
> depends on the retry policy.
> 
> http://scotgrid.blogspot.com/2007/04/intervention-at-edinburgh.html
> has a possible solution but it has never been confirmed if this is
> even related.
> It needs to be updated for new versions of torque as well.
> 
> >  Like the following ones:
> >
> >  [luis@axon-g01 ~]$ showq
> >  ACTIVE JOBS--------------------
> >  JOBNAME            USERNAME      STATE  PROC   REMAINING
> >  STARTTIME
> >
> > 6736 bio012 Running 1 2:22:44:42 Wed May 7
> >  10:20:39
> > 6737 bio012 Running 1 2:22:47:43 Wed May 7
> >  10:23:40
> >
> >      2 Active Jobs       2 of    6 Processors Active (33.33%)
> >                          1 of    3 Nodes Active      (33.33%)
> >
> >  IDLE JOBS----------------------
> >  JOBNAME            USERNAME      STATE  PROC     WCLIMIT
> >  QUEUETIME
> >
> >
> >  0 Idle Jobs
> >
> >  BLOCKED JOBS----------------
> >  JOBNAME            USERNAME      STATE  PROC     WCLIMIT
> >  QUEUETIME
> >
> > 6684 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> >  17:39:42
> > 6687 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> >  17:40:46
> > 6697 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> >  17:47:44
> > 6710 opssgm BatchHold 1 3:00:00:00 Tue May 6
> >  22:32:06
> > 6712 dteam004 BatchHold 1 3:00:00:00 Tue May 6
> >  22:58:34
> > 6724 opssgm BatchHold 1 3:00:00:00 Wed May 7
> >  03:17:38
> > 6739 opssgm Deferred 1 3:00:00:00 Wed May 7
> >  11:30:00
> >
> >  Total Jobs: 9   Active Jobs: 2   Idle Jobs: 0   Blocked Jobs: 7
> >
> >  For instance:
> >
> >  [root@axon-g01 ~]# checkjob 6710
> >
> >
> >  checking job 6710
> >
> >  State: Idle
> >  Creds:  user:opssgm  group:ops  class:ops  qos:DEFAULT
> >  WallTime: 00:00:16 of 3:00:00:00
> >  SubmitTime: Tue May  6 22:32:06
> >   (Time Queued  Total: 13:05:59  Eligible: 00:00:00)
> >
> >  StartDate: -12:39:13  Tue May  6 22:58:52
> >  Total Tasks: 1
> >
> >  Req[0]  TaskCount: 1  Partition: ALL
> >  Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> >  Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> >  NodeCount: 1
> >
> >
> >  IWD: [NONE]  Executable:  [NONE]
> >  Bypass: 0  StartCount: 26
> >  PartitionMask: [ALL]
> >  Holds:    Batch  (hold reason:  RMFailure)
> >  Messages:  cannot start job - RM failure, rc: 15057, msg: 'Cannot
> >  execute at specified host because of checkpoint or stagein files'
> >  PE:  1.00  StartPriority:  1000758
> >  cannot select job 6710 for partition DEFAULT (job hold active)
> >
> >  ...and...
> >
> >  [root@axon-g01 ~]# diagnose -j 6710
> > Name State Par Proc QOS WCLimit R Min User
> > Group Account QueuedTime Network Opsys Arch Mem Disk Procs
> >  Class Features
> >
> > 6710 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm
> > ops - 12:41:36 [NONE] [NONE] [NONE] >=0 >=0 NC0
> >  [ops:1] [NONE]
> >  WARNING:  job '6710' has failed to start 26 times
> >
> >
> >  [root@axon-g01 ~]# releasehold -a ALL; tail
> >  -f /var/spool/pbs/server_logs/20080507
> >
> >  job holds adjusted
> >
> >  (...)
> >
> >  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request
> >  received from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request
> >  received
> >  from [log in to unmask], sock=9
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> >  received
> >  from [log in to unmask], sock=9
> >  05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
> >  Modified at request of [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request
> >  received
> >  from [log in to unmask], sock=9
> >  05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15057(Cannot execute at specified host because of checkpoint or
> >  stagein files), aux=0, type=RunJob, from [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> >  received
> >  from [log in to unmask], sock=9
> >  05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
> >  Modified at request of [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> >  received
> >  from [log in to unmask], sock=9
> >  05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
> >  Modified at request of [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request
> >  received
> >  from [log in to unmask], sock=9
> >  05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15057(Cannot execute at specified host because of checkpoint or
> >  stagein files), aux=0, type=RunJob, from [log in to unmask]
> > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request
> >  received
> >  from [log in to unmask], sock=9
> >  05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
> >  Modified at request of [log in to unmask]
> > 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser
> >  request
> >  received from [log in to unmask], sock=12
> > 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request
> >  received
> >  from [log in to unmask], sock=10
> > 05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler
> >  sent
> >  command time
> >  05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request
> >  received from [log in to unmask], sock=9
> >
> >
> >  (...)
> >
> >
> >  This problem is driving me crazy for weeks...
> >
> >  - don't matter if I turn the firewall off
> >  - CE/WN connection exists
> >  - I have reinstalled the WNs metapackages
> >
> >  Please, any hint will be precious!
> >
> >  P.S. - Sorry for the long mail.
> >
> >  Thanks,
> >
> >  Luís
> >
> >
> >
> >
> >  On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote:
> >  > Hi Mario
> >  >
> >  > could you detail what problem these two nodes are having ?
> > > we are experiencing a similar problem, except that it is not
> >  > systematic
> >  >
> >  > basically we are observing spooradics
> >  > "MOM rejected modify request, error: 15001"
> >  > messages; sometimes the job get started anyway
> >  > some other times it fails the stagin operation
> > > then the job is sent back to the server and is placed in Q state,
> >  > but
> >  > then maui does not select it anymore.
> >  >
> > > we had two period of time during last weekend when we did observer
> >  > the
> > > Globus error 94, and we wander whether the two things are
> >  > correlated
> >  > with each other or not
> >  >
> >  > Cheers
> >  > Sergio :)
> >  >
> > > On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]>
> >  > wrote:
> >  >
> >  > >Well actually we may have figured out the problem. It seems two
> > > >workernodes had problems with stageout, but not something one
> >  > >would
> > > >notice immediately out of hand. We have isolated them and now SAM
> > > >tests seem to be running fine (but we'll have to wait a bit
> >  > >longer to
> > > >make sure this was the problem indeed). We also ran a separate
> >  > >test of
> > > >a job on one of that workernodes and the logging information came
> >  > >back
> > > >with exactly the known error so we do hope we have isolated it
> >  > >now. We
> >  > >will know in about 24h if all the SAM tests run through nicely.
> >  > >
> >  > >Mario
> >  > >
> >  > >On May 5, 2008, at 3:36 PM, <[log in to unmask]>
> >  > ><[log in to unmask]
> >  > >>wrote:
> >  > >
> >  > >>Hi Mario, Ilja,
> >  > >>
> > > >>>>Anyway, the exact details are available from this ggus ticket:
> >  > >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655
> >  > >>>>
> > > >>>>I have increased the maxproc settings of both "marshal"s as it
> > > >>>>seemed to be somehow related to the error ( Globus error 94:
> >  > >>>>the
> > > >>>>jobmanager does not accept any new requests (shutting down)),
> >  > >>>>will
> >  > >>>>see if it helps.
> >  > >>>>
> >  > >>>>Any other ideas are still very welcome!
> >  > >>
> > > >>It appears that the failing jobs were in fact successfully
> >  > >>submitted
> >  > >>to Torque. For example, in /opt/edg/var/gatekeeper/grid-
> >  > >>jobmap_20080505
> >  > >>(spaces replaced with newlines for clarity):
> >  > >>
> >  > >>"localUser=11860"
> >  > >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
> >  > >>CN=582979/CN=Judit Novak"
> >  > >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
> >  > >>"userFQAN=/ops/Role=NULL/Capability=NULL"
> >  > >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
> >  > >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
> >  > >>"lrmsID=42444.oberon.hep.kbfi.ee"
> >  > >>"timestamp=2008-05-05 10:07:18"
> >  > >>
> > > >>The job then may have been reported in such a way that the
> >  > >>lcgpbs job
> > > >>manager considered the job as having failed. For example, the
> >  > >>'W'
> >  > >>state
> >  > >>is treated like that. In that case you would see a cancellation
> >  > >>(qdel)
> > > >>request in the Torque logs. Can you check what happened to job
> >  > >>42444?
> >  > >>
> >  >
> >  >
> >  >
> >  > Cheers
> >  > Sergio :)
> >  >
> >  > ---------------------------------------------
> >  >   Dr. Sergio Maffioletti
> >  >
> >  >   Grid Group
> >  >   CSCS, Swiss National Supercomputing Centre
> >  >   Via Cantonale
> >  >   CH-6928 Manno
> >  >   Tel: +41916108218
> >  >   Fax: +41916108282
> >  >   email: [log in to unmask]
> >  > ---------------------------------------------
> >
> 
> 
> 
> -- 
> Steve Traylen



Cheers
Sergio :)

---------------------------------------------
  Dr. Sergio Maffioletti
 
  Grid Group
  CSCS, Swiss National Supercomputing Centre
  Via Cantonale
  CH-6928 Manno
  Tel: +41916108218
  Fax: +41916108282
  email: [log in to unmask]
---------------------------------------------