FYI, this dirty hack sorts the problem. On the WN I added: route add <your-ce-external-ip> gw <your-ce-internal-ip> Found in section 9.3.1 http://grid-it.cnaf.infn.it/fileadmin/sysadm/siteinstall/siteinstall-2_3_1.html Peter Luca Vaccarossa ([log in to unmask]) wrote: > I think that the file: > /var/spool/pbs/server_name > > have to contain the same hostname. > I think that you have to change your pbs server name, i.e. doing like > yuo have pbs server on a different host. > Luca > > > > Peter Love wrote: > > All WNs are on the private network. It seems server_name is irrelevant > > when determining which host to stagein/out from. How does the pbs_server > > tell the pbs_mom which host to stage from? Can this be configured? > > > > Peter > > > > > > Luca Vaccarossa ([log in to unmask]) wrote: > > > >>I've a mixed cluster (some WNs have public ip, others have private ip) > >>Torque server is the CE with its public name, but for private ip WNs I > >>have to confgure like that: > >> > >> > >>$clienthost privateCEhostname > >>$clienthost publicCEhostname > >>$clienthost localhost > >>$restricted *.<domain> > >>$logevent 255 > >> > >> > >>How is your configuration ? > >> > >>Peter Love wrote: > >> > >>>Unfortunately this doesn't help. I already have have this in /var/spool/pbs/mom_priv/config > >>> > >>>$clienthost ce.lancs.pygrid > >>>$clienthost localhost > >>>$restricted ce.lancs.pygrid > >>>$logevent 255 > >>>$ideal_load 1.6 > >>>$max_load 2.1 > >>> > >>>stagein/out using the public hostname (lunegw.lancs.ac.uk) > >>> > >>>Ignore the 'No route to host' error, we have firewalled port 22 on the > >>>CE public interface. The WN shouldn't use the CE's public interface for > >>>staging. > >>> > >>> > >>>PBS Job Id: 134.lunegw.lancs.ac.uk > >>>Job Name: test.sh > >>>File stage in failed, see below. > >>>Job will be retried later, please investigate and correct problem. > >>>Post job file processing error; job 134.lunegw.lancs.ac.uk on host > >>>test01.lancs.pygrid/0 > >>> > >>>Unable to copy file 134.lunegw..OU to lunegw.lancs.ac.uk:/home/dteam004/test.sh.o134 > >>> > >>> > >>>>>>error from copy > >>> > >>>lunegw.lancs.ac.uk: No route to host > >>>uk port 22: No route to host > >>>lost connection > >>> > >>> > >>>>>>end error output > >>> > >>>Output retained on that host in: /var/spool/pbs/undelivered/134.lunegw..OU > >>> > >>> > >>>Why would the WN /var/spool/pbs/server_name contain public_CE_HOSTNAME? > >>>I confirmed this doesn't affect things. > >>> > >>>Peter > >>> > >>> > >>>Luca Vaccarossa ([log in to unmask]) wrote: > >>> > >>> > >>>>Peter Love wrote: > >>>> > >>>> > >>>>>Hi, > >>>>> > >>>>>We're setting up a new farm with WNs on a private network, without > >>>>>shared /home. My question is how to configure torque to specify the CE's > >>>>>private hostname (ce.lancs.pygrid) when submiting jobs to the WNs. At > >>>>>the moment the WNs attempt to copy output back to the torque server via > >>>>>the public hostname of the CE, which I assume is found using 'hostname > >>>>>-f' at the time qsub is run. > >>>>> > >>>>>All the public/private keys are in order, copying from WNs to > >>>>>ce.lancs.pygrid works fine. > >>>>> > >>>>>The WN /var/spool/pbs/server_name file contains 'ce.lancs.pygrid'. > >>>>> > >>>>>Is this a jobmanager issue? Should the qsub specify the server as > >>>>>'ce.lancs.pygrid'? > >>>>> > >>>>>Besides the brief gocwiki docs, is there any docs around for private > >>>>>network config probs? Are all sites with private network using NFS > >>>>>shared /home ? > >>>>> > >>>>>Peter > >>>> > >>>>Yuo have to put the the CE's private hostname as first in the file > >>>>/var/spool/pbs/mom_priv/config > >>>> > >>>> > >>>>$clienthost ce.lancs.pygrid > >>>>$clienthost public_CE_HOSTNAME > >>>> > >>>> > >>>> > >>>>on your WNs on a private network. > >>>>In file /var/spool/pbs/server_name I have the > >>>>public_CE_HOSTNAME > >>>> > >>>> > >>>>I hope this help. > >>>> > >>>>Luca