Print

Print


I think that the file:
  /var/spool/pbs/server_name

have to contain the same hostname.
I think that you have to change your pbs server name, i.e. doing like
yuo have pbs server on a different host.
Luca



Peter Love wrote:
> All WNs are on the private network. It seems server_name is irrelevant
> when determining which host to stagein/out from. How does the pbs_server
> tell the pbs_mom which host to stage from? Can this be configured?
>
> Peter
>
>
> Luca Vaccarossa ([log in to unmask]) wrote:
>
>>I've a mixed cluster (some WNs  have public ip, others have private ip)
>>Torque server is the CE with its public name, but for private ip WNs I
>>have to confgure like that:
>>
>>
>>$clienthost privateCEhostname
>>$clienthost publicCEhostname
>>$clienthost localhost
>>$restricted *.<domain>
>>$logevent 255
>>
>>
>>How is your configuration ?
>>
>>Peter Love wrote:
>>
>>>Unfortunately this doesn't help. I already have have this in /var/spool/pbs/mom_priv/config
>>>
>>>$clienthost ce.lancs.pygrid
>>>$clienthost localhost
>>>$restricted ce.lancs.pygrid
>>>$logevent 255
>>>$ideal_load 1.6
>>>$max_load 2.1
>>>
>>>stagein/out using the public hostname (lunegw.lancs.ac.uk)
>>>
>>>Ignore the 'No route to host' error, we have firewalled port 22 on the
>>>CE public interface. The WN shouldn't use the CE's public interface for
>>>staging.
>>>
>>>
>>>PBS Job Id: 134.lunegw.lancs.ac.uk
>>>Job Name:   test.sh
>>>File stage in failed, see below.
>>>Job will be retried later, please investigate and correct problem.
>>>Post job file processing error; job 134.lunegw.lancs.ac.uk on host
>>>test01.lancs.pygrid/0
>>>
>>>Unable to copy file 134.lunegw..OU to lunegw.lancs.ac.uk:/home/dteam004/test.sh.o134
>>>
>>>
>>>>>>error from copy
>>>
>>>lunegw.lancs.ac.uk: No route to host
>>>uk port 22: No route to host
>>>lost connection
>>>
>>>
>>>>>>end error output
>>>
>>>Output retained on that host in: /var/spool/pbs/undelivered/134.lunegw..OU
>>>
>>>
>>>Why would the WN /var/spool/pbs/server_name contain public_CE_HOSTNAME?
>>>I confirmed this doesn't affect things.
>>>
>>>Peter
>>>
>>>
>>>Luca Vaccarossa ([log in to unmask]) wrote:
>>>
>>>
>>>>Peter Love wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>
>>>>>We're setting up a new farm with WNs on a private network, without
>>>>>shared /home. My question is how to configure torque to specify the CE's
>>>>>private hostname (ce.lancs.pygrid) when submiting jobs to the WNs. At
>>>>>the moment the WNs attempt to copy output back to the torque server via
>>>>>the public hostname of the CE, which I assume is found using 'hostname
>>>>>-f' at the time qsub is run.
>>>>>
>>>>>All the public/private keys are in order, copying from WNs to
>>>>>ce.lancs.pygrid works fine.
>>>>>
>>>>>The WN /var/spool/pbs/server_name file contains 'ce.lancs.pygrid'.
>>>>>
>>>>>Is this a jobmanager issue? Should the qsub specify the server as
>>>>>'ce.lancs.pygrid'?
>>>>>
>>>>>Besides the brief gocwiki docs, is there any docs around for private
>>>>>network config probs? Are all sites with private network using NFS
>>>>>shared /home ?
>>>>>
>>>>>Peter
>>>>
>>>>Yuo have to put the the CE's private hostname as first in the file
>>>>/var/spool/pbs/mom_priv/config
>>>>
>>>>
>>>>$clienthost ce.lancs.pygrid
>>>>$clienthost public_CE_HOSTNAME
>>>>
>>>>
>>>>
>>>>on your WNs on a private network.
>>>>In file /var/spool/pbs/server_name I have the
>>>>public_CE_HOSTNAME
>>>>
>>>>
>>>>I hope this help.
>>>>
>>>>Luca