On Thu, Aug 20, 2009 at 11:59:35AM +0100, David Ambrose-Griffith wrote:
> Harper, Rob (STFC,RAL,PPD) wrote:
> > Hi all,
> >
> > We have an issue where Torque is using up a lot of ports (as expected)
> > to talk to the MOMs, but these are all in the range <1024. The upshot
> > of this is that at times, other services (we're specifically noticing
> > this with NFS) are unable to get hold of a port themselves, and hilarity
> > ensues.
> >
> > Googling has yielded limited results, but it does seem that, by default,
> > PBS uses privileged ports so that the MOMs know that requests are coming
> > from a root account, using this as a form of sanity/security check.
> > Seeing as we could have up to ~1500 job slots available, I imagine we're
> > going to see issues of this type from time to time.
> >
> > Has anyone out there seen this, and possibly even dealt with it? I'd
> > like to clear the batch system away from that port range, but haven't
> > yet been able to work out how to approach this. Any thoughts?
> >
> > Cheers,
> > Rob
> >
> >
> We've seen the same at Durham.
>
> We've mitigated slightly by extending the range of ports that NFS uses,
> by setting sunrpc.min_resvport to 300 (from 600) in sysctl.conf
>
> This doesn't fix it, just gives NFS a better chance of getting a port.
You can compile torque with --disable-privports but this allows users to
bypass the security and submit jobs as any other user which is probably
not what you want.
You can use net.ipv4.tcp_tw_(recycle|reuse) to allow faster reclaiming
of ports in TW state but talk to your local tcp guru before touching
them.
Switch nfs to udp so it doesn't compete with torque for open ports
(assuming that torque uses tcp). Torque can still run out of ports
on it's own though.
Switch to a batch system that allows you to use some other form of
authenticating clients than privileged ports (gridengine can use
x509 keys for example).
Kostas
|