Hi thanks for the suggestions we did increased the MaxStartups value in the sshd_config on our CE to 400 ( this is way too much but is just for staying in the safe zone ) and, at the same time, reduced the MaxAuthTries to 3 as the problem is only sporadic, we will monitor the system with this new config and see how it behaves for a long period of time we will keep you updated on this thanks again Cheers Sergio :) On 07, May 2008 01:39 PM, Steve Traylen <[log in to unmask]> wrote: > 2008/5/7 IEETA_Grid_initiative <[log in to unmask]>: > > Hi Sergio, > > > > So... any news about the 15001 error? > > > > think I'm having a similar problem here. > > > > from the WN logs (/var/spool/pbs/mom_logs/) I can see: > > > > 05/07/2008 08:25:00;0080; pbs_mom;Req;req_reject;Reject reply > > code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job > > failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob, > > from > > [log in to unmask] > > > > the odd thing is that in 4 equal jobs submitted by me 3 have > > concluded > > successfully and 1 falls in "BatchHold" for eternity... > > > > Sergio, > > How many job slots, typical # jobs do you have in the first instance? > > You may need to increase the number of permitted scp connections in > sshd_config > on your CE. > > On one of the apparently affected WNs can you run. > > # momctl -d 1 > > You may be able to clear the stale job with a > > # momctl -c <jobid> -h <WN.example.org> > > I'm not sure if this will just delete the job though, I think it > depends on the retry policy. > > http://scotgrid.blogspot.com/2007/04/intervention-at-edinburgh.html > has a possible solution but it has never been confirmed if this is > even related. > It needs to be updated for new versions of torque as well. > > > Like the following ones: > > > > [luis@axon-g01 ~]$ showq > > ACTIVE JOBS-------------------- > > JOBNAME USERNAME STATE PROC REMAINING > > STARTTIME > > > > 6736 bio012 Running 1 2:22:44:42 Wed May 7 > > 10:20:39 > > 6737 bio012 Running 1 2:22:47:43 Wed May 7 > > 10:23:40 > > > > 2 Active Jobs 2 of 6 Processors Active (33.33%) > > 1 of 3 Nodes Active (33.33%) > > > > IDLE JOBS---------------------- > > JOBNAME USERNAME STATE PROC WCLIMIT > > QUEUETIME > > > > > > 0 Idle Jobs > > > > BLOCKED JOBS---------------- > > JOBNAME USERNAME STATE PROC WCLIMIT > > QUEUETIME > > > > 6684 dteam004 BatchHold 1 3:00:00:00 Tue May 6 > > 17:39:42 > > 6687 dteam004 BatchHold 1 3:00:00:00 Tue May 6 > > 17:40:46 > > 6697 dteam004 BatchHold 1 3:00:00:00 Tue May 6 > > 17:47:44 > > 6710 opssgm BatchHold 1 3:00:00:00 Tue May 6 > > 22:32:06 > > 6712 dteam004 BatchHold 1 3:00:00:00 Tue May 6 > > 22:58:34 > > 6724 opssgm BatchHold 1 3:00:00:00 Wed May 7 > > 03:17:38 > > 6739 opssgm Deferred 1 3:00:00:00 Wed May 7 > > 11:30:00 > > > > Total Jobs: 9 Active Jobs: 2 Idle Jobs: 0 Blocked Jobs: 7 > > > > For instance: > > > > [root@axon-g01 ~]# checkjob 6710 > > > > > > checking job 6710 > > > > State: Idle > > Creds: user:opssgm group:ops class:ops qos:DEFAULT > > WallTime: 00:00:16 of 3:00:00:00 > > SubmitTime: Tue May 6 22:32:06 > > (Time Queued Total: 13:05:59 Eligible: 00:00:00) > > > > StartDate: -12:39:13 Tue May 6 22:58:52 > > Total Tasks: 1 > > > > Req[0] TaskCount: 1 Partition: ALL > > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > > Opsys: [NONE] Arch: [NONE] Features: [NONE] > > NodeCount: 1 > > > > > > IWD: [NONE] Executable: [NONE] > > Bypass: 0 StartCount: 26 > > PartitionMask: [ALL] > > Holds: Batch (hold reason: RMFailure) > > Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot > > execute at specified host because of checkpoint or stagein files' > > PE: 1.00 StartPriority: 1000758 > > cannot select job 6710 for partition DEFAULT (job hold active) > > > > ...and... > > > > [root@axon-g01 ~]# diagnose -j 6710 > > Name State Par Proc QOS WCLimit R Min User > > Group Account QueuedTime Network Opsys Arch Mem Disk Procs > > Class Features > > > > 6710 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm > > ops - 12:41:36 [NONE] [NONE] [NONE] >=0 >=0 NC0 > > [ops:1] [NONE] > > WARNING: job '6710' has failed to start 26 times > > > > > > [root@axon-g01 ~]# releasehold -a ALL; tail > > -f /var/spool/pbs/server_logs/20080507 > > > > job holds adjusted > > > > (...) > > > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request > > received from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request > > received > > from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request > > received > > from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job > > Modified at request of [log in to unmask] > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request > > received > > from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply > > code=15057(Cannot execute at specified host because of checkpoint or > > stagein files), aux=0, type=RunJob, from [log in to unmask] > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request > > received > > from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job > > Modified at request of [log in to unmask] > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request > > received > > from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job > > Modified at request of [log in to unmask] > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request > > received > > from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply > > code=15057(Cannot execute at specified host because of checkpoint or > > stagein files), aux=0, type=RunJob, from [log in to unmask] > > 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request > > received > > from [log in to unmask], sock=9 > > 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job > > Modified at request of [log in to unmask] > > 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser > > request > > received from [log in to unmask], sock=12 > > 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request > > received > > from [log in to unmask], sock=10 > > 05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler > > sent > > command time > > 05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request > > received from [log in to unmask], sock=9 > > > > > > (...) > > > > > > This problem is driving me crazy for weeks... > > > > - don't matter if I turn the firewall off > > - CE/WN connection exists > > - I have reinstalled the WNs metapackages > > > > Please, any hint will be precious! > > > > P.S. - Sorry for the long mail. > > > > Thanks, > > > > Luís > > > > > > > > > > On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote: > > > Hi Mario > > > > > > could you detail what problem these two nodes are having ? > > > we are experiencing a similar problem, except that it is not > > > systematic > > > > > > basically we are observing spooradics > > > "MOM rejected modify request, error: 15001" > > > messages; sometimes the job get started anyway > > > some other times it fails the stagin operation > > > then the job is sent back to the server and is placed in Q state, > > > but > > > then maui does not select it anymore. > > > > > > we had two period of time during last weekend when we did observer > > > the > > > Globus error 94, and we wander whether the two things are > > > correlated > > > with each other or not > > > > > > Cheers > > > Sergio :) > > > > > > On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> > > > wrote: > > > > > > >Well actually we may have figured out the problem. It seems two > > > >workernodes had problems with stageout, but not something one > > > >would > > > >notice immediately out of hand. We have isolated them and now SAM > > > >tests seem to be running fine (but we'll have to wait a bit > > > >longer to > > > >make sure this was the problem indeed). We also ran a separate > > > >test of > > > >a job on one of that workernodes and the logging information came > > > >back > > > >with exactly the known error so we do hope we have isolated it > > > >now. We > > > >will know in about 24h if all the SAM tests run through nicely. > > > > > > > >Mario > > > > > > > >On May 5, 2008, at 3:36 PM, <[log in to unmask]> > > > ><[log in to unmask] > > > >>wrote: > > > > > > > >>Hi Mario, Ilja, > > > >> > > > >>>>Anyway, the exact details are available from this ggus ticket: > > > >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655 > > > >>>> > > > >>>>I have increased the maxproc settings of both "marshal"s as it > > > >>>>seemed to be somehow related to the error ( Globus error 94: > > > >>>>the > > > >>>>jobmanager does not accept any new requests (shutting down)), > > > >>>>will > > > >>>>see if it helps. > > > >>>> > > > >>>>Any other ideas are still very welcome! > > > >> > > > >>It appears that the failing jobs were in fact successfully > > > >>submitted > > > >>to Torque. For example, in /opt/edg/var/gatekeeper/grid- > > > >>jobmap_20080505 > > > >>(spaces replaced with newlines for clarity): > > > >> > > > >>"localUser=11860" > > > >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/ > > > >>CN=582979/CN=Judit Novak" > > > >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL" > > > >>"userFQAN=/ops/Role=NULL/Capability=NULL" > > > >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ" > > > >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short" > > > >>"lrmsID=42444.oberon.hep.kbfi.ee" > > > >>"timestamp=2008-05-05 10:07:18" > > > >> > > > >>The job then may have been reported in such a way that the > > > >>lcgpbs job > > > >>manager considered the job as having failed. For example, the > > > >>'W' > > > >>state > > > >>is treated like that. In that case you would see a cancellation > > > >>(qdel) > > > >>request in the Torque logs. Can you check what happened to job > > > >>42444? > > > >> > > > > > > > > > > > > Cheers > > > Sergio :) > > > > > > --------------------------------------------- > > > Dr. Sergio Maffioletti > > > > > > Grid Group > > > CSCS, Swiss National Supercomputing Centre > > > Via Cantonale > > > CH-6928 Manno > > > Tel: +41916108218 > > > Fax: +41916108282 > > > email: [log in to unmask] > > > --------------------------------------------- > > > > > > -- > Steve Traylen Cheers Sergio :) --------------------------------------------- Dr. Sergio Maffioletti Grid Group CSCS, Swiss National Supercomputing Centre Via Cantonale CH-6928 Manno Tel: +41916108218 Fax: +41916108282 email: [log in to unmask] ---------------------------------------------