Hi,
Is your firewall accepting port 15001, 15002, 15003 and 15004 (pbs)
connections?
BR,
Luís
Sex, 2008-04-11 às 23:32 +0800, jwhuang escreveu:
> Hi All,
>
> I found the job would be rejected by MOM and got error code 15001.
> Any idea about this?
> Thanks a lot.
>
> Br,
> Jhen-Wei
> ---------------------------------------------------------------
> # qstat -a
> ce.hpc.csie.thu.edu.tw:
>
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK
> Memory Time S Time
> -------------------- -------- -------- ---------- ------ ----- ---
> ------ ----- - -----
> 197.ce.hpc.csie.thu. opssgm ops STDIN 17449 1 --
> -- 48:00 R 00:00
> 198.ce.hpc.csie.thu. opssgm ops STDIN -- 1 --
> -- 48:00 Q --
> 199.ce.hpc.csie.thu. opssgm ops STDIN -- 1 --
> -- 48:00 Q --
>
>
> # tracejob 197
> /var/spool/pbs/mom_logs/20080411: No matching job records located
> /var/spool/pbs/sched_logs/20080411: No such file or directory
>
> Job: 197.ce.hpc.csie.thu.edu.tw
>
> 04/11/2008 14:57:05 S enqueuing into ops, state 1 hop 1
> 04/11/2008 14:57:05 S Job Queued at request of
> [log in to unmask], owner = [log in to unmask],
> job
> name = STDIN, queue = ops
> 04/11/2008 14:57:05 S Job Modified at request of
> [log in to unmask]
> 04/11/2008 14:57:05 S Job Run at request of
> [log in to unmask]
> 04/11/2008 14:57:05 S Job Modified at request of
> [log in to unmask]
> 04/11/2008 14:57:05 S MOM rejected modify request, error: 15001
> 04/11/2008 14:57:05 A queue=ops
> 04/11/2008 14:57:05 A user=opssgm group=ops jobname=STDIN
> queue=ops ctime=1207925824 qtime=1207925825 etime=1207925825
> start=1207925825
> exec_host=ce.hpc.csie.thu.edu.tw/0 Resource_List.cput=48:00:00
> Resource_List.neednodes=1
> Resource_List.nodect=1 Resource_List.nodes=1
> Resource_List.walltime=72:00:00
>
>
>
> # tail /var/spool/pbs/mom_logs/20080411
> 04/11/2008 14:52:34;0002; pbs_mom;Svr;ideal_load;2
> 04/11/2008 14:52:34;0080; pbs_mom;n/a;add_static;config[0] add name
> ideal_load value 2
> 04/11/2008 14:52:34;0002; pbs_mom;Svr;max_load;2
> 04/11/2008 14:52:34;0080; pbs_mom;n/a;add_static;config[0] add name
> max_load value 2
> 04/11/2008 14:52:34;0002; pbs_mom;n/a;initialize;independent
> 04/11/2008 14:52:34;0002; pbs_mom;Svr;pbs_mom;Is up
> 04/11/2008 14:52:34;0002; pbs_mom;Svr;mom_main;MOM executable path
> and mtime at launch: /usr/sbin/pbs_mom 1190384299
> 04/11/2008 14:52:34;0002; pbs_mom;n/a;mom_main;hello sent to server
> ce.hpc.csie.thu.edu.tw
> 04/11/2008 14:57:05;0080; pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id REJHOST=ce.hpc.csie.thu.edu.tw MSG=modify
> job failed, unknown job 197.ce.hpc.csie.thu.edu.tw), aux=0,
> type=ModifyJob, from [log in to unmask]
> 04/11/2008 14:57:05;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 197.ce.hpc.csie.thu.edu.tw started, pid = 17449
>
>
> --
> OPS Team, ASGC
> Tel: +886-2-2788-0058 #1005
|