Hi Patrick,
What interfaces ?!
try edit by hand /var/spool/pbs/mom_priv/config on the WNs and there
name the CE with its local name - the one seen by the WNs in their local
network, in both $clienthost and $restricted lines; do service pbs_mom
restart afterward on WNs.
Regards,
Dan
Patrick Guio wrote:
>On Fri, 9 Dec 2005 06:53:22 +0200, Dan Schrager <[log in to unmask]> wrote:
>
>Hi Dan,
>
>Yes I can su - dteam001 on the WN and ssh back to the CE both via the
>internal and external interfaces.
>
>
>
>>on the WN you should be able to:
>>
>>su - dteam001
>>ssh CEhost
>>without password
>>
>>are you ?
>>
>>Patrick Guio wrote:
>>
>>
>>
>>>Dear all,
>>>
>>>I have now what seems to be a working queue system between CE and WN.
>>>I have the following queues:
>>>% qstat -q
>>>
>>>server: grid.bccs.uib.no
>>>
>>>Queue Memory CPU Time Walltime Node Run Que Lm State
>>>---------------- ------ -------- -------- ---- --- --- -- -----
>>>dteam -- 48:00:00 72:00:00 -- 0 0 -- E R
>>>default -- 48:00:00 72:00:00 -- 0 0 -- E R
>>>
>>>I can submit and run jobs with my userid on the default queue (which is not
>>>the default, dteam is:-)
>>>% qsub -q default test-pbs.sh
>>>31.grid.bccs.uib.no
>>>% ls -l test-pbs.sh.?31
>>>-rw------- 1 patrickg patrickg 0 Dec 8 22:57 test-pbs.sh.e31
>>>-rw------- 1 patrickg patrickg 161 Dec 8 22:57 test-pbs.sh.o31
>>>
>>>I can also su - dteam001 and submit and run jobs without problem:
>>>dteam001@grid dteam001]$ qsub pbs_sub
>>>32.grid.bccs.uib.no
>>>[dteam001@grid dteam001]$ ls -l pbs_sub.?32
>>>-rw------- 1 dteam001 dteam 0 Dec 8 22:59 pbs_sub.e32
>>>-rw------- 1 dteam001 dteam 29 Dec 8 22:59 pbs_sub.o32
>>>
>>>When running a globus job like:
>>>% globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue dteam
>>>/bin/hostname
>>>I can see that a job is submitted by user dteam001
>>>% qstat
>>>Job id Name User Time Use S Queue
>>>---------------- ---------------- ---------------- -------- - -----
>>>33.grid STDIN dteam001 0 Q dteam
>>>
>>>But after a while the globus-job-run exits without any output.
>>>
>>>On the WN I can see logs for these jobs in the file
>>>/var/spool/pbs/mom_logs/20051208:
>>>
>>>pbs_mom;Job;31.grid.bccs.uib.no;using transient tmpdir
>>>/var/spool/pbs/31.grid.bccs.uib.no
>>>pbs_mom;Job;31.grid.bccs.uib.no;Started, pid = 1750
>>>pbs_mom;Job;31.grid.bccs.uib.no;scan_for_terminated: task 1 terminated,
>>>
>>>
>sid 1750
>
>
>>>pbs_mom;Job;31.grid.bccs.uib.no;Terminated
>>>pbs_mom;Job;31.grid.bccs.uib.no;Removing transient job directory
>>>/var/spool/pbs/31.grid.bccs.uib.no
>>>pbs_mom;Job;31.grid.bccs.uib.no;Obit sent
>>>
>>>and similarly for job 32.grid.bccs.uib.no submitted by user dteam001.
>>>
>>>When it comes to the job 33.grid.bccs.uib.no submitted by globus I can see
>>>the following output:
>>>
>>>pbs_mom;Req;req_reject;Reject reply code=15001, aux=0, type=11, from
>>>[log in to unmask]
>>>pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=1
>>>pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp -r
>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=2
>>>pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=3
>>>pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp -r
>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>globus-cache-export.VT6VlM.gpg status=1 (copy request failed), try=4
>>>pbs_mom;Fil;globus-cache-export.VT6VlM.gpg;Unable to copy file
>>>globus-cache-export.VT6VlM.gpg from
>>>grid.bccs.uib.no:/home/dteam001/.lcgjm/globus-cache-export.VT6VlM/globus-cache-export.VT6VlM.gpg
>>>pbs_mom;Fil;globus-cache-export.VT6VlM.gpg;grid.bccs.uib.no: Connection
>>>
>>>
>refused
>
>
>>>pbs_mom;Fil;globus-cache-export.VT6VlM.gpg;ssion denied
>>>pbs_mom;Req;del_files;cannot stat globus-cache-export.VT6VlM.gpg
>>>
>>>On the server side (CE) in /var/spool/pbs/server_logs/20051208 I can see log
>>>for the 31.grid.bccs.uib.no:
>>>
>>>31.grid.bccs.uib.no;enqueuing into default, state 1 hop 1
>>>31.grid.bccs.uib.no;Job Queued at request of [log in to unmask],
>>>owner = [log in to unmask], job name = test-pbs.sh, queue = default
>>>31.grid.bccs.uib.no;Job Modified at request of [log in to unmask]
>>>
>>>31.grid.bccs.uib.no;Job Run at request of [log in to unmask]
>>>31.grid.bccs.uib.no;Exit_status=0 resources_used.cput=00:00:00
>>>resources_used.mem=0kb resources_used.vmem=0kb
>>>
>>>
>resources_used.walltime=00:00:00
>
>
>>>31.grid.bccs.uib.no;dequeuing from default, state 5
>>>
>>>The 32.grid.bccs.uib.no job log looks similar with six 32.grid.bccs.uib.no
>>>statements.
>>>
>>>The 33.grid.bccs.uib.no job is also similar with six 33.grid.bccs.uib.no
>>>statements but after there is this error
>>>33.grid.bccs.uib.no;MOM rejected modify request, error: 15001
>>>req_reject;Reject reply code=15001, aux=0, type=11, from [log in to unmask]
>>>
>>>It looks like there a ssh related problem but I cannot understand that since
>>>I am able to ssh/scp -B as user dteam001 from the CE (resp. WN) to the WN
>>>(resp. CE) without password or passphrase.
>>>
>>>On the WN:
>>>[dteam001@compute-0-0 dteam001]$ scp -B
>>>[log in to unmask]:pbs_sub.o32 junk
>>>[dteam001@compute-0-0 dteam001]$ scp -B junk
>>>[log in to unmask]:pbs_sub.o32
>>>
>>>on the CE:
>>>[dteam001@grid dteam001]$ scp -B [log in to unmask]:pbs_sub.o32 junk
>>>[dteam001@grid dteam001]$ scp -B junk [log in to unmask]:pbs_sub.o32
>>>
>>>If I run a new globus-job I can check that on the WN (or the CE since
>>>dteam001 $HOME is an exportfs) that there is indeed a directory created of
>>>the type:
>>>[dteam001@compute-0-0 dteam001]$ ls .lcgjm/globus-cache-export.aNNBpH/
>>>cache_export_dir.tar export.3 export.txt stage_in.txt
>>>export.1 export.4 file_cleanup.txt stage_out.txt
>>>export.2 export.5 globus-cache-export.aNNBpH.gpg stdstreams.txt
>>>
>>>So the file
>>>[log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.aNNBpH/globus-cache-export.aNNBpH.gpg
>>>is there but some reason it fails to scp or pbs_rcp it.
>>>
>>>Has anyone experienced this kind of situation? Any idea what is the error id
>>>15001 ?
>>>
>>>Any help would be appreciated.
>>>
>>>Sincerely,
>>>
>>>Patrick
>>>
>>>+++++++++++++++++++++++++++++++++++++++++++
>>>This Mail Was Scanned By Mail-seCure System
>>>at the Tel-Aviv University CC.
>>>
>>>
>>>
>>>
>>
>>
>
> +++++++++++++++++++++++++++++++++++++++++++
> This Mail Was Scanned By Mail-seCure System
> at the Tel-Aviv University CC.
>
>
|