Thanks Kostas and David for the advice.
We already have NFS running on an extended port range, but I'll have a
look into switching it to UDP and/or trying the TCP port recycling
boost.
Switching to another batch system is probably not something we want to
do in the short term, but it's a thought.
Cheers,
Rob
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]] On Behalf Of Kostas Georgiou
> Sent: Thursday, August 20, 2009 12:44 PM
> To: [log in to unmask]
> Subject: Re: Torque using up privileged ports
>
> On Thu, Aug 20, 2009 at 11:59:35AM +0100, David
> Ambrose-Griffith wrote:
>
> > Harper, Rob (STFC,RAL,PPD) wrote:
> > > Hi all,
> > >
> > > We have an issue where Torque is using up a lot of ports (as
> > > expected) to talk to the MOMs, but these are all in the
> range <1024.
> > > The upshot of this is that at times, other services (we're
> > > specifically noticing this with NFS) are unable to get hold of a
> > > port themselves, and hilarity ensues.
> > >
> > > Googling has yielded limited results, but it does seem that, by
> > > default, PBS uses privileged ports so that the MOMs know that
> > > requests are coming from a root account, using this as a
> form of sanity/security check.
> > > Seeing as we could have up to ~1500 job slots available,
> I imagine
> > > we're going to see issues of this type from time to time.
> > >
> > > Has anyone out there seen this, and possibly even dealt with it?
> > > I'd like to clear the batch system away from that port range, but
> > > haven't yet been able to work out how to approach this.
> Any thoughts?
> > >
> > > Cheers,
> > > Rob
> > >
> > >
> > We've seen the same at Durham.
> >
> > We've mitigated slightly by extending the range of ports that NFS
> > uses, by setting sunrpc.min_resvport to 300 (from 600) in
> sysctl.conf
> >
> > This doesn't fix it, just gives NFS a better chance of
> getting a port.
>
> You can compile torque with --disable-privports but this
> allows users to bypass the security and submit jobs as any
> other user which is probably not what you want.
>
> You can use net.ipv4.tcp_tw_(recycle|reuse) to allow faster
> reclaiming of ports in TW state but talk to your local tcp
> guru before touching them.
>
> Switch nfs to udp so it doesn't compete with torque for open
> ports (assuming that torque uses tcp). Torque can still run
> out of ports on it's own though.
>
> Switch to a batch system that allows you to use some other
> form of authenticating clients than privileged ports
> (gridengine can use
> x509 keys for example).
>
> Kostas
>
--
Scanned by iCritical.
|