JISCMail - LCG-ROLLOUT Archives

Hi Sergio,

So... any news about the 15001 error?

think I'm having a similar problem here.

from the WN logs (/var/spool/pbs/mom_logs/) I can see:

05/07/2008 08:25:00;0080;   pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=axon-g06.ieeta.pt MSG=modify job
failed, unknown job 6733.axon-g01.ieeta.pt), aux=0, type=ModifyJob, from
[log in to unmask]

the odd thing is that in 4 equal jobs submitted by me 3 have concluded
successfully and 1 falls in "BatchHold" for eternity...

Like the following ones:

[luis@axon-g01 ~]$ showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING
STARTTIME

6736                 bio012    Running     1  2:22:44:42  Wed May  7
10:20:39
6737                 bio012    Running     1  2:22:47:43  Wed May  7
10:23:40

     2 Active Jobs       2 of    6 Processors Active (33.33%)
                         1 of    3 Nodes Active      (33.33%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME

6684               dteam004  BatchHold     1  3:00:00:00  Tue May  6
17:39:42
6687               dteam004  BatchHold     1  3:00:00:00  Tue May  6
17:40:46
6697               dteam004  BatchHold     1  3:00:00:00  Tue May  6
17:47:44
6710                 opssgm  BatchHold     1  3:00:00:00  Tue May  6
22:32:06
6712               dteam004  BatchHold     1  3:00:00:00  Tue May  6
22:58:34
6724                 opssgm  BatchHold     1  3:00:00:00  Wed May  7
03:17:38
6739                 opssgm   Deferred     1  3:00:00:00  Wed May  7
11:30:00

Total Jobs: 9   Active Jobs: 2   Idle Jobs: 0   Blocked Jobs: 7

For instance:

[root@axon-g01 ~]# checkjob 6710


checking job 6710

State: Idle
Creds:  user:opssgm  group:ops  class:ops  qos:DEFAULT
WallTime: 00:00:16 of 3:00:00:00
SubmitTime: Tue May  6 22:32:06
  (Time Queued  Total: 13:05:59  Eligible: 00:00:00)

StartDate: -12:39:13  Tue May  6 22:58:52
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 26
PartitionMask: [ALL]
Holds:    Batch  (hold reason:  RMFailure)
Messages:  cannot start job - RM failure, rc: 15057, msg: 'Cannot
execute at specified host because of checkpoint or stagein files'
PE:  1.00  StartPriority:  1000758
cannot select job 6710 for partition DEFAULT (job hold active)

...and...

[root@axon-g01 ~]# diagnose -j 6710
Name                  State Par Proc QOS     WCLimit R  Min     User
Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs
Class Features

6710                   Idle ALL    1 DEF  3:00:00:00 0    1   opssgm
ops        -    12:41:36   [NONE] [NONE] [NONE]    >=0    >=0    NC0
[ops:1] [NONE]
WARNING:  job '6710' has failed to start 26 times


[root@axon-g01 ~]# releasehold -a ALL; tail
-f /var/spool/pbs/server_logs/20080507

job holds adjusted

(...)

05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusQueue request
received from [log in to unmask], sock=9
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type StatusJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
code=15057(Cannot execute at specified host because of checkpoint or
stagein files), aux=0, type=RunJob, from [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6710.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type RunJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0080;PBS_Server;Req;req_reject;Reject reply
code=15057(Cannot execute at specified host because of checkpoint or
stagein files), aux=0, type=RunJob, from [log in to unmask]
05/07/2008 12:00:09;0100;PBS_Server;Req;;Type ModifyJob request received
from [log in to unmask], sock=9
05/07/2008 12:00:09;0008;PBS_Server;Job;6724.axon-g01.ieeta.pt;Job
Modified at request of [log in to unmask]
05/07/2008 12:00:16;0100;PBS_Server;Req;;Type AuthenticateUser request
received from [log in to unmask], sock=12
05/07/2008 12:00:16;0100;PBS_Server;Req;;Type StatusJob request received
from [log in to unmask], sock=10
05/07/2008 12:00:22;0040;PBS_Server;Svr;axon-g01.ieeta.pt;Scheduler sent
command time
05/07/2008 12:00:22;0100;PBS_Server;Req;;Type StatusNode request
received from [log in to unmask], sock=9


(...)


This problem is driving me crazy for weeks...

- don't matter if I turn the firewall off
- CE/WN connection exists
- I have reinstalled the WNs metapackages

Please, any hint will be precious!

P.S. - Sorry for the long mail.

Thanks,

Luís


On Mon, 2008-05-05 at 23:43 +0200, Sergio Maffioletti wrote: 
> Hi Mario
> 
> could you detail what problem these two nodes are having ?
> we are experiencing a similar problem, except that it is not systematic
> 
> basically we are observing spooradics
> "MOM rejected modify request, error: 15001"
> messages; sometimes the job get started anyway
> some other times it fails the stagin operation
> then the job is sent back to the server and is placed in Q state, but
> then maui does not select it anymore.
> 
> we had two period of time during last weekend when we did observer the
> Globus error 94, and we wander whether the two things are correlated
> with each other or not
> 
> Cheers
> Sergio :)
> 
> On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> wrote:
> 
> >Well actually we may have figured out the problem. It seems two
> >workernodes had problems with stageout, but not something one would
> >notice immediately out of hand. We have isolated them and now SAM
> >tests seem to be running fine (but we'll have to wait a bit longer to
> >make sure this was the problem indeed). We also ran a separate test of
> >a job on one of that workernodes and the logging information came back
> >with exactly the known error so we do hope we have isolated it now. We
> >will know in about 24h if all the SAM tests run through nicely.
> >
> >Mario
> >
> >On May 5, 2008, at 3:36 PM, <[log in to unmask]>
> ><[log in to unmask]
> >>wrote:
> >
> >>Hi Mario, Ilja,
> >>
> >>>>Anyway, the exact details are available from this ggus ticket:
> >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655
> >>>>
> >>>>I have increased the maxproc settings of both "marshal"s as it
> >>>>seemed to be somehow related to the error ( Globus error 94: the
> >>>>jobmanager does not accept any new requests (shutting down)), will
> >>>>see if it helps.
> >>>>
> >>>>Any other ideas are still very welcome!
> >>
> >>It appears that the failing jobs were in fact successfully submitted
> >>to Torque. For example, in /opt/edg/var/gatekeeper/grid-
> >>jobmap_20080505
> >>(spaces replaced with newlines for clarity):
> >>
> >>"localUser=11860"
> >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
> >>CN=582979/CN=Judit Novak"
> >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
> >>"userFQAN=/ops/Role=NULL/Capability=NULL"
> >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
> >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
> >>"lrmsID=42444.oberon.hep.kbfi.ee"
> >>"timestamp=2008-05-05 10:07:18"
> >>
> >>The job then may have been reported in such a way that the lcgpbs job
> >>manager considered the job as having failed. For example, the 'W'
> >>state
> >>is treated like that. In that case you would see a cancellation
> >>(qdel)
> >>request in the Torque logs. Can you check what happened to job 42444?
> >>
> 
> 
> 
> Cheers
> Sergio :)
> 
> ---------------------------------------------
>   Dr. Sergio Maffioletti
>  
>   Grid Group
>   CSCS, Swiss National Supercomputing Centre
>   Via Cantonale
>   CH-6928 Manno
>   Tel: +41916108218
>   Fax: +41916108282
>   email: [log in to unmask]
> ---------------------------------------------