Hi Sergio, So... any news about the 15001 error? think I'm having a similar problem here. from the WN logs (/var/spool/pbs/mom_logs/) I can see: 05/07/2008 08:25:00;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob, from [log in to unmask] the odd thing is that in 4 equal jobs submitted by me 3 have concluded successfully and 1 falls in "BatchHold" for eternity... Like the following ones: [luis@axon-g01 ~]$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 6736 bio012 Running 1 2:22:44:42 Wed May 7 10:20:39 6737 bio012 Running 1 2:22:47:43 Wed May 7 10:23:40 2 Active Jobs 2 of 6 Processors Active (33.33%) 1 of 3 Nodes Active (33.33%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 6684 dteam004 BatchHold 1 3:00:00:00 Tue May 6 17:39:42 6687 dteam004 BatchHold 1 3:00:00:00 Tue May 6 17:40:46 6697 dteam004 BatchHold 1 3:00:00:00 Tue May 6 17:47:44 6710 opssgm BatchHold 1 3:00:00:00 Tue May 6 22:32:06 6712 dteam004 BatchHold 1 3:00:00:00 Tue May 6 22:58:34 6724 opssgm BatchHold 1 3:00:00:00 Wed May 7 03:17:38 6739 opssgm Deferred 1 3:00:00:00 Wed May 7 11:30:00 Total Jobs: 9 Active Jobs: 2 Idle Jobs: 0 Blocked Jobs: 7 For instance: [root@axon-g01 ~]# checkjob 6710 checking job 6710 State: Idle Creds: user:opssgm group:ops class:ops qos:DEFAULT WallTime: 00:00:16 of 3:00:00:00 SubmitTime: Tue May 6 22:32:06 (Time Queued Total: 13:05:59 Eligible: 00:00:00) StartDate: -12:39:13 Tue May 6 22:58:52 Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] NodeCount: 1 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 26 PartitionMask: [ALL] Holds: Batch (hold reason: RMFailure) Messages: cannot start job - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files' PE: 1.00 StartPriority: 1000758 cannot select job 6710 for partition DEFAULT (job hold active) ...and... [root@axon-g01 ~]# diagnose -j 6710 Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features 6710 Idle ALL 1 DEF 3:00:00:00 0 1 opssgm ops - 12:41:36 [NONE] [NONE] [NONE] >=0 >=0 NC0 [ops:1] [NONE] WARNING: job '6710' has failed to start 26 times [root@axon-g01 ~]# releasehold -a ALL; tail -f /var/spool/pbs/server_logs/20080507 job holds adjusted (...) 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job Modified at request of [log in to unmask] 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply code=15057(Cannot execute at specified host because of checkpoint or stagein files), aux=0, type=RunJob, from [log in to unmask] 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job Modified at request of [log in to unmask] 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job Modified at request of [log in to unmask] 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply code=15057(Cannot execute at specified host because of checkpoint or stagein files), aux=0, type=RunJob, from [log in to unmask] 05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received from [log in to unmask], sock=9 05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job Modified at request of [log in to unmask] 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser request received from [log in to unmask], sock=12 05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request received from [log in to unmask], sock=10 05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler sent command time 05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request received from [log in to unmask], sock=9 (...) This problem is driving me crazy for weeks... - don't matter if I turn the firewall off - CE/WN connection exists - I have reinstalled the WNs metapackages Please, any hint will be precious! P.S. - Sorry for the long mail. Thanks, Luís On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote: > Hi Mario > > could you detail what problem these two nodes are having ? > we are experiencing a similar problem, except that it is not systematic > > basically we are observing spooradics > "MOM rejected modify request, error: 15001" > messages; sometimes the job get started anyway > some other times it fails the stagin operation > then the job is sent back to the server and is placed in Q state, but > then maui does not select it anymore. > > we had two period of time during last weekend when we did observer the > Globus error 94, and we wander whether the two things are correlated > with each other or not > > Cheers > Sergio :) > > On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> wrote: > > >Well actually we may have figured out the problem. It seems two > >workernodes had problems with stageout, but not something one would > >notice immediately out of hand. We have isolated them and now SAM > >tests seem to be running fine (but we'll have to wait a bit longer to > >make sure this was the problem indeed). We also ran a separate test of > >a job on one of that workernodes and the logging information came back > >with exactly the known error so we do hope we have isolated it now. We > >will know in about 24h if all the SAM tests run through nicely. > > > >Mario > > > >On May 5, 2008, at 3:36 PM, <[log in to unmask]> > ><[log in to unmask] > >>wrote: > > > >>Hi Mario, Ilja, > >> > >>>>Anyway, the exact details are available from this ggus ticket: > >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655 > >>>> > >>>>I have increased the maxproc settings of both "marshal"s as it > >>>>seemed to be somehow related to the error ( Globus error 94: the > >>>>jobmanager does not accept any new requests (shutting down)), will > >>>>see if it helps. > >>>> > >>>>Any other ideas are still very welcome! > >> > >>It appears that the failing jobs were in fact successfully submitted > >>to Torque. For example, in /opt/edg/var/gatekeeper/grid- > >>jobmap_20080505 > >>(spaces replaced with newlines for clarity): > >> > >>"localUser=11860" > >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/ > >>CN=582979/CN=Judit Novak" > >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL" > >>"userFQAN=/ops/Role=NULL/Capability=NULL" > >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ" > >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short" > >>"lrmsID=42444.oberon.hep.kbfi.ee" > >>"timestamp=2008-05-05 10:07:18" > >> > >>The job then may have been reported in such a way that the lcgpbs job > >>manager considered the job as having failed. For example, the 'W' > >>state > >>is treated like that. In that case you would see a cancellation > >>(qdel) > >>request in the Torque logs. Can you check what happened to job 42444? > >> > > > > Cheers > Sergio :) > > --------------------------------------------- > Dr. Sergio Maffioletti > > Grid Group > CSCS, Swiss National Supercomputing Centre > Via Cantonale > CH-6928 Manno > Tel: +41916108218 > Fax: +41916108282 > email: [log in to unmask] > ---------------------------------------------