On Thu, Mar 31, 2005 at 04:05:00PM +0100 or thereabouts, Peter Love wrote:
> All WNs are on the private network. It seems server_name is irrelevant
> when determining which host to stagein/out from. How does the pbs_server
> tell the pbs_mom which host to stage from? Can this be configured?
The stage from is the submission host, if you have a split Gatekeeper and
Torque batch server it is the gatekeeper host that the files are staged from.
Have a look at the man page for qsub , in particular
qsub -e internal.example.org:. script.sh
and see if that does something? It looks like it should set PBS_O_HOST in the job to your
internal name rather than just using `hostname` on the host.
In you find that some combination of flags works on the you can use a job filter script to mangle
the options on the fly.
Get qsub working first.
Generally you want to take the advice from someone who has actually done this though.
Steve
>
> Peter
>
>
> Luca Vaccarossa ([log in to unmask]) wrote:
> > I've a mixed cluster (some WNs have public ip, others have private ip)
> > Torque server is the CE with its public name, but for private ip WNs I
> > have to confgure like that:
> >
> >
> > $clienthost privateCEhostname
> > $clienthost publicCEhostname
> > $clienthost localhost
> > $restricted *.<domain>
> > $logevent 255
> >
> >
> > How is your configuration ?
> >
> > Peter Love wrote:
> > > Unfortunately this doesn't help. I already have have this in /var/spool/pbs/mom_priv/config
> > >
> > > $clienthost ce.lancs.pygrid
> > > $clienthost localhost
> > > $restricted ce.lancs.pygrid
> > > $logevent 255
> > > $ideal_load 1.6
> > > $max_load 2.1
> > >
> > > stagein/out using the public hostname (lunegw.lancs.ac.uk)
> > >
> > > Ignore the 'No route to host' error, we have firewalled port 22 on the
> > > CE public interface. The WN shouldn't use the CE's public interface for
> > > staging.
> > >
> > >
> > > PBS Job Id: 134.lunegw.lancs.ac.uk
> > > Job Name: test.sh
> > > File stage in failed, see below.
> > > Job will be retried later, please investigate and correct problem.
> > > Post job file processing error; job 134.lunegw.lancs.ac.uk on host
> > > test01.lancs.pygrid/0
> > >
> > > Unable to copy file 134.lunegw..OU to lunegw.lancs.ac.uk:/home/dteam004/test.sh.o134
> > >
> > >>>>error from copy
> > >
> > > lunegw.lancs.ac.uk: No route to host
> > > uk port 22: No route to host
> > > lost connection
> > >
> > >>>>end error output
> > >
> > > Output retained on that host in: /var/spool/pbs/undelivered/134.lunegw..OU
> > >
> > >
> > > Why would the WN /var/spool/pbs/server_name contain public_CE_HOSTNAME?
> > > I confirmed this doesn't affect things.
> > >
> > > Peter
> > >
> > >
> > > Luca Vaccarossa ([log in to unmask]) wrote:
> > >
> > >>Peter Love wrote:
> > >>
> > >>>Hi,
> > >>>
> > >>>We're setting up a new farm with WNs on a private network, without
> > >>>shared /home. My question is how to configure torque to specify the CE's
> > >>>private hostname (ce.lancs.pygrid) when submiting jobs to the WNs. At
> > >>>the moment the WNs attempt to copy output back to the torque server via
> > >>>the public hostname of the CE, which I assume is found using 'hostname
> > >>>-f' at the time qsub is run.
> > >>>
> > >>>All the public/private keys are in order, copying from WNs to
> > >>>ce.lancs.pygrid works fine.
> > >>>
> > >>>The WN /var/spool/pbs/server_name file contains 'ce.lancs.pygrid'.
> > >>>
> > >>>Is this a jobmanager issue? Should the qsub specify the server as
> > >>>'ce.lancs.pygrid'?
> > >>>
> > >>>Besides the brief gocwiki docs, is there any docs around for private
> > >>>network config probs? Are all sites with private network using NFS
> > >>>shared /home ?
> > >>>
> > >>>Peter
> > >>
> > >>Yuo have to put the the CE's private hostname as first in the file
> > >>/var/spool/pbs/mom_priv/config
> > >>
> > >>
> > >>$clienthost ce.lancs.pygrid
> > >>$clienthost public_CE_HOSTNAME
> > >>
> > >>
> > >>
> > >>on your WNs on a private network.
> > >>In file /var/spool/pbs/server_name I have the
> > >>public_CE_HOSTNAME
> > >>
> > >>
> > >>I hope this help.
> > >>
> > >>Luca
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|