Hi Patrick,
My e-science stops here.
However, from my own experience with local/public interfaces I would
suggest the following:
- use only one set of yaim definition files for all nodes, service or WN.
- call your service nodes with the public name only and the WNs with the
full name (host.localdomain)
- do not alter /etc/hosts, keep it clean with just the localhost line -
free from host's short name, etc.
- on my CE, /etc/hosts.equiv holds the WN names.
- leave the routing for the routing tables only; SNAT the WNs only if
they talk to the world but not to your service nodes (and UI).
- let yaim configure all your nodes and... let the Force be with you !!
Regards,
Dan
P.S. I still got in store some tricks for special cases, if you know
what I mean :-) . For instance scripts that rollback the 2.4 lcg rpms
or, newer, the 2.6 lcg rpms so that one could start afresh a yaim
installation&configuration procedure. Everything is on sale 'cause I'm
shutting down operations in EGEE area; if one wants them, one has but to
ask.
Patrick Guio wrote:
>On Fri, 9 Dec 2005 13:23:23 +0200, Dan Schrager <[log in to unmask]> wrote:
>
>Hi Dan,
>
>If you have a look at the other thread I started (maui/torque trouble),
>you'll see the answer of Vega explaining internal and external interfaces
>http://www.listserv.rl.ac.uk/cgi-bin/webadmin?A2=ind0512&L=lcg-rollout&F=&S=&X=69338A7BA3664D697B&Y=patrick.guio%40bccs.uib.no&P=14331
>
>It fixed the problem to submit and run a simple job.
>Now I am trying to understand why globus jobs fail to run.
>
>I mentionned the error message 15001 in the first mail of that thread.
>In the mom log:
>
>
>>>>>pbs_mom;Req;req_reject;Reject reply code=15001, aux=0, type=11, from
>>>>>
>>>>>
>In the server log
>
>
>>>>>33.grid.bccs.uib.no;MOM rejected modify request, error: 15001
>>>>>req_reject;Reject reply code=15001, aux=0, type=11, from
>>>>>
>>>>>
>[log in to unmask]
>
>I found in the torque administrator's manual the error codes
>(http://www.clusterresources.com/products/torque/docs20/a.derrorcodes.shtml)
>
>15001 means " Unknown Job Identifier"
>
>Has anyone had similar problems?
>
>Cheers,
>
>Patrick
>
>
>
>
>>Hi Patrick,
>>
>>What interfaces ?!
>>
>>try edit by hand /var/spool/pbs/mom_priv/config on the WNs and there
>>name the CE with its local name - the one seen by the WNs in their local
>>network, in both $clienthost and $restricted lines; do service pbs_mom
>>restart afterward on WNs.
>>
>>Regards,
>>Dan
>>
>>
>>Patrick Guio wrote:
>>
>>
>>
>>>On Fri, 9 Dec 2005 06:53:22 +0200, Dan Schrager <[log in to unmask]> wrote:
>>>
>>>Hi Dan,
>>>
>>>Yes I can su - dteam001 on the WN and ssh back to the CE both via the
>>>internal and external interfaces.
>>>
>>>
>>>
>>>
>>>
>>>>on the WN you should be able to:
>>>>
>>>>su - dteam001
>>>>ssh CEhost
>>>>without password
>>>>
>>>>are you ?
>>>>
>>>>Patrick Guio wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Dear all,
>>>>>
>>>>>I have now what seems to be a working queue system between CE and WN.
>>>>>I have the following queues:
>>>>>% qstat -q
>>>>>
>>>>>server: grid.bccs.uib.no
>>>>>
>>>>>Queue Memory CPU Time Walltime Node Run Que Lm State
>>>>>---------------- ------ -------- -------- ---- --- --- -- -----
>>>>>dteam -- 48:00:00 72:00:00 -- 0 0 -- E R
>>>>>default -- 48:00:00 72:00:00 -- 0 0 -- E R
>>>>>
>>>>>I can submit and run jobs with my userid on the default queue (which is not
>>>>>the default, dteam is:-)
>>>>>% qsub -q default test-pbs.sh
>>>>>31.grid.bccs.uib.no
>>>>>% ls -l test-pbs.sh.?31
>>>>>-rw------- 1 patrickg patrickg 0 Dec 8 22:57 test-pbs.sh.e31
>>>>>-rw------- 1 patrickg patrickg 161 Dec 8 22:57 test-pbs.sh.o31
>>>>>
>>>>>I can also su - dteam001 and submit and run jobs without problem:
>>>>>dteam001@grid dteam001]$ qsub pbs_sub
>>>>>32.grid.bccs.uib.no
>>>>>[dteam001@grid dteam001]$ ls -l pbs_sub.?32
>>>>>-rw------- 1 dteam001 dteam 0 Dec 8 22:59 pbs_sub.e32
>>>>>-rw------- 1 dteam001 dteam 29 Dec 8 22:59 pbs_sub.o32
>>>>>
>>>>>When running a globus job like:
>>>>>% globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue dteam
>>>>>/bin/hostname
>>>>>I can see that a job is submitted by user dteam001
>>>>>% qstat
>>>>>Job id Name User Time Use S Queue
>>>>>---------------- ---------------- ---------------- -------- - -----
>>>>>33.grid STDIN dteam001 0 Q dteam
>>>>>
>>>>>But after a while the globus-job-run exits without any output.
>>>>>
>>>>>On the WN I can see logs for these jobs in the file
>>>>>/var/spool/pbs/mom_logs/20051208:
>>>>>
>>>>>pbs_mom;Job;31.grid.bccs.uib.no;using transient tmpdir
>>>>>/var/spool/pbs/31.grid.bccs.uib.no
>>>>>pbs_mom;Job;31.grid.bccs.uib.no;Started, pid = 1750
>>>>>pbs_mom;Job;31.grid.bccs.uib.no;scan_for_terminated: task 1 terminated,
>>>>>
>>>>>
>>>>>
>>>>>
>>>sid 1750
>>>
>>>
>>>
>>>
>>>>>pbs_mom;Job;31.grid.bccs.uib.no;Terminated
>>>>>pbs_mom;Job;31.grid.bccs.uib.no;Removing transient job directory
>>>>>/var/spool/pbs/31.grid.bccs.uib.no
>>>>>pbs_mom;Job;31.grid.bccs.uib.no;Obit sent
>>>>>
>>>>>and similarly for job 32.grid.bccs.uib.no submitted by user dteam001.
>>>>>
>>>>>When it comes to the job 33.grid.bccs.uib.no submitted by globus I can see
>>>>>the following output:
>>>>>
>>>>>pbs_mom;Req;req_reject;Reject reply code=15001, aux=0, type=11, from
>>>>>[log in to unmask]
>>>>>pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
>>>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=1
>>>>>pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp -r
>>>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=2
>>>>>pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
>>>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=3
>>>>>pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp -r
>>>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=4
>>>>>pbs_mom;Fil;globus-cache-export.VT6VlM.gpg;Unable to copy file
>>>>>globus-cache-export.VT6VlM.gpg from
>>>>>grid.bccs.uib.no:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>>>pbs_mom;Fil;globus-cache-export.VT6VlM.gpg;grid.bccs.uib.no: Connection
>>>>>
>>>>>
>>>>>
>>>>>
>>>refused
>>>
>>>
>>>
>>>
>>>>>pbs_mom;Fil;globus-cache-export.VT6VlM.gpg;ssion denied
>>>>>pbs_mom;Req;del_files;cannot stat globus-cache-export.VT6VlM.gpg
>>>>>
>>>>>On the server side (CE) in /var/spool/pbs/server_logs/20051208 I can see log
>>>>>for the 31.grid.bccs.uib.no:
>>>>>
>>>>>31.grid.bccs.uib.no;enqueuing into default, state 1 hop 1
>>>>>31.grid.bccs.uib.no;Job Queued at request of [log in to unmask],
>>>>>owner = [log in to unmask], job name = test-pbs.sh, queue = default
>>>>>31.grid.bccs.uib.no;Job Modified at request of [log in to unmask]
>>>>>
>>>>>31.grid.bccs.uib.no;Job Run at request of [log in to unmask]
>>>>>31.grid.bccs.uib.no;Exit_status=0 resources_used.cput=00:00:00
>>>>>resources_used.mem=0kb resources_used.vmem=0kb
>>>>>
>>>>>
>>>>>
>>>>>
>>>resources_used.walltime=00:00:00
>>>
>>>
>>>
>>>
>>>>>31.grid.bccs.uib.no;dequeuing from default, state 5
>>>>>
>>>>>The 32.grid.bccs.uib.no job log looks similar with six 32.grid.bccs.uib.no
>>>>>statements.
>>>>>
>>>>>The 33.grid.bccs.uib.no job is also similar with six 33.grid.bccs.uib.no
>>>>>statements but after there is this error
>>>>>33.grid.bccs.uib.no;MOM rejected modify request, error: 15001
>>>>>req_reject;Reject reply code=15001, aux=0, type=11, from
>>>>>
>>>>>
>[log in to unmask]
>
>
>>>>>It looks like there a ssh related problem but I cannot understand that since
>>>>>I am able to ssh/scp -B as user dteam001 from the CE (resp. WN) to the WN
>>>>>(resp. CE) without password or passphrase.
>>>>>
>>>>>On the WN:
>>>>>[dteam001@compute-0-0 dteam001]$ scp -B
>>>>>[log in to unmask]:pbs_sub.o32 junk
>>>>>[dteam001@compute-0-0 dteam001]$ scp -B junk
>>>>>[log in to unmask]:pbs_sub.o32
>>>>>
>>>>>on the CE:
>>>>>[dteam001@grid dteam001]$ scp -B [log in to unmask]:pbs_sub.o32 junk
>>>>>[dteam001@grid dteam001]$ scp -B junk [log in to unmask]:pbs_sub.o32
>>>>>
>>>>>If I run a new globus-job I can check that on the WN (or the CE since
>>>>>dteam001 $HOME is an exportfs) that there is indeed a directory created of
>>>>>the type:
>>>>>[dteam001@compute-0-0 dteam001]$ ls .lcgjm/globus-cache-export.aNNBpH/
>>>>>cache_export_dir.tar export.3 export.txt stage_in.txt
>>>>>export.1 export.4 file_cleanup.txt
>>>>>
>>>>>
>stage_out.txt
>
>
>>>>>export.2 export.5 globus-cache-export.aNNBpH.gpg
>>>>>
>>>>>
>stdstreams.txt
>
>
>>>>>So the file
>>>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.aNNBpH/globus-cache-export.aNNBpH.gpg
>>>>>is there but some reason it fails to scp or pbs_rcp it.
>>>>>
>>>>>Has anyone experienced this kind of situation? Any idea what is the error id
>>>>>15001 ?
>>>>>
>>>>>Any help would be appreciated.
>>>>>
>>>>>Sincerely,
>>>>>
>>>>>Patrick
>>>>>
>>>>>+++++++++++++++++++++++++++++++++++++++++++
>>>>>This Mail Was Scanned By Mail-seCure System
>>>>>at the Tel-Aviv University CC.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>+++++++++++++++++++++++++++++++++++++++++++
>>>This Mail Was Scanned By Mail-seCure System
>>>at the Tel-Aviv University CC.
>>>
>>>
>>>
>>>
>>
>>
>
> +++++++++++++++++++++++++++++++++++++++++++
> This Mail Was Scanned By Mail-seCure System
> at the Tel-Aviv University CC.
>
>
|