Print

Print


FYI, this dirty hack sorts the problem.

On the WN I added:
route add <your-ce-external-ip> gw <your-ce-internal-ip>

Found in section 9.3.1
http://grid-it.cnaf.infn.it/fileadmin/sysadm/siteinstall/siteinstall-2_3_1.html

Peter

Luca Vaccarossa ([log in to unmask]) wrote:
> I think that the file:
>   /var/spool/pbs/server_name
>
> have to contain the same hostname.
> I think that you have to change your pbs server name, i.e. doing like
> yuo have pbs server on a different host.
> Luca
>
>
>
> Peter Love wrote:
> > All WNs are on the private network. It seems server_name is irrelevant
> > when determining which host to stagein/out from. How does the pbs_server
> > tell the pbs_mom which host to stage from? Can this be configured?
> >
> > Peter
> >
> >
> > Luca Vaccarossa ([log in to unmask]) wrote:
> >
> >>I've a mixed cluster (some WNs  have public ip, others have private ip)
> >>Torque server is the CE with its public name, but for private ip WNs I
> >>have to confgure like that:
> >>
> >>
> >>$clienthost privateCEhostname
> >>$clienthost publicCEhostname
> >>$clienthost localhost
> >>$restricted *.<domain>
> >>$logevent 255
> >>
> >>
> >>How is your configuration ?
> >>
> >>Peter Love wrote:
> >>
> >>>Unfortunately this doesn't help. I already have have this in /var/spool/pbs/mom_priv/config
> >>>
> >>>$clienthost ce.lancs.pygrid
> >>>$clienthost localhost
> >>>$restricted ce.lancs.pygrid
> >>>$logevent 255
> >>>$ideal_load 1.6
> >>>$max_load 2.1
> >>>
> >>>stagein/out using the public hostname (lunegw.lancs.ac.uk)
> >>>
> >>>Ignore the 'No route to host' error, we have firewalled port 22 on the
> >>>CE public interface. The WN shouldn't use the CE's public interface for
> >>>staging.
> >>>
> >>>
> >>>PBS Job Id: 134.lunegw.lancs.ac.uk
> >>>Job Name:   test.sh
> >>>File stage in failed, see below.
> >>>Job will be retried later, please investigate and correct problem.
> >>>Post job file processing error; job 134.lunegw.lancs.ac.uk on host
> >>>test01.lancs.pygrid/0
> >>>
> >>>Unable to copy file 134.lunegw..OU to lunegw.lancs.ac.uk:/home/dteam004/test.sh.o134
> >>>
> >>>
> >>>>>>error from copy
> >>>
> >>>lunegw.lancs.ac.uk: No route to host
> >>>uk port 22: No route to host
> >>>lost connection
> >>>
> >>>
> >>>>>>end error output
> >>>
> >>>Output retained on that host in: /var/spool/pbs/undelivered/134.lunegw..OU
> >>>
> >>>
> >>>Why would the WN /var/spool/pbs/server_name contain public_CE_HOSTNAME?
> >>>I confirmed this doesn't affect things.
> >>>
> >>>Peter
> >>>
> >>>
> >>>Luca Vaccarossa ([log in to unmask]) wrote:
> >>>
> >>>
> >>>>Peter Love wrote:
> >>>>
> >>>>
> >>>>>Hi,
> >>>>>
> >>>>>We're setting up a new farm with WNs on a private network, without
> >>>>>shared /home. My question is how to configure torque to specify the CE's
> >>>>>private hostname (ce.lancs.pygrid) when submiting jobs to the WNs. At
> >>>>>the moment the WNs attempt to copy output back to the torque server via
> >>>>>the public hostname of the CE, which I assume is found using 'hostname
> >>>>>-f' at the time qsub is run.
> >>>>>
> >>>>>All the public/private keys are in order, copying from WNs to
> >>>>>ce.lancs.pygrid works fine.
> >>>>>
> >>>>>The WN /var/spool/pbs/server_name file contains 'ce.lancs.pygrid'.
> >>>>>
> >>>>>Is this a jobmanager issue? Should the qsub specify the server as
> >>>>>'ce.lancs.pygrid'?
> >>>>>
> >>>>>Besides the brief gocwiki docs, is there any docs around for private
> >>>>>network config probs? Are all sites with private network using NFS
> >>>>>shared /home ?
> >>>>>
> >>>>>Peter
> >>>>
> >>>>Yuo have to put the the CE's private hostname as first in the file
> >>>>/var/spool/pbs/mom_priv/config
> >>>>
> >>>>
> >>>>$clienthost ce.lancs.pygrid
> >>>>$clienthost public_CE_HOSTNAME
> >>>>
> >>>>
> >>>>
> >>>>on your WNs on a private network.
> >>>>In file /var/spool/pbs/server_name I have the
> >>>>public_CE_HOSTNAME
> >>>>
> >>>>
> >>>>I hope this help.
> >>>>
> >>>>Luca