JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT May 2008
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Fwd: [Fwd: Re: [LCG-ROLLOUT] recent optimizations in lcg-ce]
From:
Steve Traylen <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Wed, 7 May 2008 13:39:22 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (289 lines)
2008/5/7 IEETA_Grid_initiative <[log in to unmask]>:
> Hi Sergio,
>
>  So... any news about the 15001 error?
>
>  think I'm having a similar problem here.
>
>  from the WN logs (/var/spool/pbs/mom_logs/) I can see:
>
>  05/07/2008 08:25:00;0080;   pbs_mom;Req;req_reject;Reject reply
>  code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job
>  failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob, from
>  [log in to unmask]
>
>  the odd thing is that in 4 equal jobs submitted by me 3 have concluded
>  successfully and 1 falls in "BatchHold" for eternity...
>

Sergio,

How many job slots, typical # jobs do you have in the first instance?

You may need to increase the number of permitted scp connections in sshd_config
on your CE.

On one of the apparently affected WNs can you run.

# momctl -d 1

You may be able to clear the stale job with a

# momctl -c <jobid> -h <WN.example.org>

I'm not sure if this will just delete the job though, I think it
depends on the retry policy.

http://scotgrid.blogspot.com/2007/04/intervention-at-edinburgh.html
has a possible solution but it has never been confirmed if this is even related.
It needs to be updated for new versions of torque as well.

>  Like the following ones:
>
>  [luis@axon-g01 ~]$ showq
>  ACTIVE JOBS--------------------
>  JOBNAME            USERNAME      STATE  PROC   REMAINING
>  STARTTIME
>
>  6736                 bio012    Running     1  2:22:44:42  Wed May  7
>  10:20:39
>  6737                 bio012    Running     1  2:22:47:43  Wed May  7
>  10:23:40
>
>      2 Active Jobs       2 of    6 Processors Active (33.33%)
>                          1 of    3 Nodes Active      (33.33%)
>
>  IDLE JOBS----------------------
>  JOBNAME            USERNAME      STATE  PROC     WCLIMIT
>  QUEUETIME
>
>
>  0 Idle Jobs
>
>  BLOCKED JOBS----------------
>  JOBNAME            USERNAME      STATE  PROC     WCLIMIT
>  QUEUETIME
>
>  6684               dteam004  BatchHold     1  3:00:00:00  Tue May  6
>  17:39:42
>  6687               dteam004  BatchHold     1  3:00:00:00  Tue May  6
>  17:40:46
>  6697               dteam004  BatchHold     1  3:00:00:00  Tue May  6
>  17:47:44
>  6710                 opssgm  BatchHold     1  3:00:00:00  Tue May  6
>  22:32:06
>  6712               dteam004  BatchHold     1  3:00:00:00  Tue May  6
>  22:58:34
>  6724                 opssgm  BatchHold     1  3:00:00:00  Wed May  7
>  03:17:38
>  6739                 opssgm   Deferred     1  3:00:00:00  Wed May  7
>  11:30:00
>
>  Total Jobs: 9   Active Jobs: 2   Idle Jobs: 0   Blocked Jobs: 7
>
>  For instance:
>
>  [root@axon-g01 ~]# checkjob 6710
>
>
>  checking job 6710
>
>  State: Idle
>  Creds:  user:opssgm  group:ops  class:ops  qos:DEFAULT
>  WallTime: 00:00:16 of 3:00:00:00
>  SubmitTime: Tue May  6 22:32:06
>   (Time Queued  Total: 13:05:59  Eligible: 00:00:00)
>
>  StartDate: -12:39:13  Tue May  6 22:58:52
>  Total Tasks: 1
>
>  Req[0]  TaskCount: 1  Partition: ALL
>  Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>  Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>  NodeCount: 1
>
>
>  IWD: [NONE]  Executable:  [NONE]
>  Bypass: 0  StartCount: 26
>  PartitionMask: [ALL]
>  Holds:    Batch  (hold reason:  RMFailure)
>  Messages:  cannot start job - RM failure, rc: 15057, msg: 'Cannot
>  execute at specified host because of checkpoint or stagein files'
>  PE:  1.00  StartPriority:  1000758
>  cannot select job 6710 for partition DEFAULT (job hold active)
>
>  ...and...
>
>  [root@axon-g01 ~]# diagnose -j 6710
>  Name                  State Par Proc QOS     WCLimit R  Min     User
>  Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs
>  Class Features
>
>  6710                   Idle ALL    1 DEF  3:00:00:00 0    1   opssgm
>  ops        -    12:41:36   [NONE] [NONE] [NONE]    >=0    >=0    NC0
>  [ops:1] [NONE]
>  WARNING:  job '6710' has failed to start 26 times
>
>
>  [root@axon-g01 ~]# releasehold -a ALL; tail
>  -f /var/spool/pbs/server_logs/20080507
>
>  job holds adjusted
>
>  (...)
>
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request
>  received from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request received
>  from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
>  from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
>  Modified at request of [log in to unmask]
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
>  from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
>  code=15057(Cannot execute at specified host because of checkpoint or
>  stagein files), aux=0, type=RunJob, from [log in to unmask]
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
>  from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
>  Modified at request of [log in to unmask]
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
>  from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
>  Modified at request of [log in to unmask]
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
>  from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
>  code=15057(Cannot execute at specified host because of checkpoint or
>  stagein files), aux=0, type=RunJob, from [log in to unmask]
>  05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
>  from [log in to unmask], sock=9
>  05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
>  Modified at request of [log in to unmask]
>  05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser request
>  received from [log in to unmask], sock=12
>  05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request received
>  from [log in to unmask], sock=10
>  05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler sent
>  command time
>  05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request
>  received from [log in to unmask], sock=9
>
>
>  (...)
>
>
>  This problem is driving me crazy for weeks...
>
>  - don't matter if I turn the firewall off
>  - CE/WN connection exists
>  - I have reinstalled the WNs metapackages
>
>  Please, any hint will be precious!
>
>  P.S. - Sorry for the long mail.
>
>  Thanks,
>
>  Luís
>
>
>
>
>  On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote:
>  > Hi Mario
>  >
>  > could you detail what problem these two nodes are having ?
>  > we are experiencing a similar problem, except that it is not systematic
>  >
>  > basically we are observing spooradics
>  > "MOM rejected modify request, error: 15001"
>  > messages; sometimes the job get started anyway
>  > some other times it fails the stagin operation
>  > then the job is sent back to the server and is placed in Q state, but
>  > then maui does not select it anymore.
>  >
>  > we had two period of time during last weekend when we did observer the
>  > Globus error 94, and we wander whether the two things are correlated
>  > with each other or not
>  >
>  > Cheers
>  > Sergio :)
>  >
>  > On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> wrote:
>  >
>  > >Well actually we may have figured out the problem. It seems two
>  > >workernodes had problems with stageout, but not something one would
>  > >notice immediately out of hand. We have isolated them and now SAM
>  > >tests seem to be running fine (but we'll have to wait a bit longer to
>  > >make sure this was the problem indeed). We also ran a separate test of
>  > >a job on one of that workernodes and the logging information came back
>  > >with exactly the known error so we do hope we have isolated it now. We
>  > >will know in about 24h if all the SAM tests run through nicely.
>  > >
>  > >Mario
>  > >
>  > >On May 5, 2008, at 3:36 PM, <[log in to unmask]>
>  > ><[log in to unmask]
>  > >>wrote:
>  > >
>  > >>Hi Mario, Ilja,
>  > >>
>  > >>>>Anyway, the exact details are available from this ggus ticket:
>  > >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655
>  > >>>>
>  > >>>>I have increased the maxproc settings of both "marshal"s as it
>  > >>>>seemed to be somehow related to the error ( Globus error 94: the
>  > >>>>jobmanager does not accept any new requests (shutting down)), will
>  > >>>>see if it helps.
>  > >>>>
>  > >>>>Any other ideas are still very welcome!
>  > >>
>  > >>It appears that the failing jobs were in fact successfully submitted
>  > >>to Torque. For example, in /opt/edg/var/gatekeeper/grid-
>  > >>jobmap_20080505
>  > >>(spaces replaced with newlines for clarity):
>  > >>
>  > >>"localUser=11860"
>  > >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
>  > >>CN=582979/CN=Judit Novak"
>  > >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
>  > >>"userFQAN=/ops/Role=NULL/Capability=NULL"
>  > >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
>  > >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
>  > >>"lrmsID=42444.oberon.hep.kbfi.ee"
>  > >>"timestamp=2008-05-05 10:07:18"
>  > >>
>  > >>The job then may have been reported in such a way that the lcgpbs job
>  > >>manager considered the job as having failed. For example, the 'W'
>  > >>state
>  > >>is treated like that. In that case you would see a cancellation
>  > >>(qdel)
>  > >>request in the Torque logs. Can you check what happened to job 42444?
>  > >>
>  >
>  >
>  >
>  > Cheers
>  > Sergio :)
>  >
>  > ---------------------------------------------
>  >   Dr. Sergio Maffioletti
>  >
>  >   Grid Group
>  >   CSCS, Swiss National Supercomputing Centre
>  >   Via Cantonale
>  >   CH-6928 Manno
>  >   Tel: +41916108218
>  >   Fax: +41916108282
>  >   email: [log in to unmask]
>  > ---------------------------------------------
>



-- 
Steve Traylen
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options