Hello
The error you have mentioned (Too Many Open Files) might be related to
the problem with pbs_server <-> munge communication. Older versions were
leaking file descriptors.
If you have 2 cream servers - it's worth to have a look at
munge-api-patch. It will speed up things significantly and free server from
communication with munge daemon based on files and popen.
In CYFRONET_LCG2 we are running our custom-built Torque 2.5.12 with:
* munge-api-patch
* patch for pbs_mom segfault in tm_request (significant for many-core jobs)
* torque-memleak-gpgpu-v2.patch (pbs_server memory leak when GPGPUs are
used)
* torque-2.5.10-spread-polls-uniformly.patch (from Eygene Ryabinkin)
* compiled-in with debug symbols
* enabled core dumps for pbs_mom
We have no problems :).
I think we can share our packages which are based on RPMS from EPEL they
are stable. If you're interested just let us know.
Cheers
--
Lukasz Flis
On 06.11.2012 10:53, Sam Skipsey wrote:
> In general, the "too many open files" issue seems, experimentally, to be
> caused by not having a munge daemon running where there should be one.
>
> All services that need to authenticate to the batch system need a munge
> service, so it might be worth checking carefully to make sure that none
> of them have died.
>
> Sam
>
>
> On 6 November 2012 09:43, Stephen Jones <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
>
> On 11/05/2012 11:58 AM, Kashif Mohammad wrote:
>
> Since moving to emi2 creamce and torque server,
>
>
> We to have an emi2 creamce and torque server.
>
>
> We have a separate batch server fed by two creamce and around
> 1300 job slots.
>
>
> That's like Liverpool - slots is 952, current jobs is 950.
>
>
> We are using torque-server-2.5.7-7.el5 and munge is enabled.
>
>
> So are we.
>
>
> Pbs_server is crashing periodically and log is full of this kind
> of error
>
> PBS_Server: LOG_ERROR::Too many open files (24) in job_save,
> open for full save
>
> LOG_ERROR::stream_eof, connection to t2wn71.physics.ox.ac.uk
> <http://t2wn71.physics.ox.ac.uk> is bad, remote service may be
> down, message may be corrupt, or connection may have been
> dropped remotely (End of File). setting node state to down
>
>
> We get none of the first error. We get some of the second error
> (if/when the network or node is broken or busy I suspect).
>
>
> I know that there is a bug in torque-server-2.5.7 that it opens
> a lot of munge credential file and doesn't close it properly
> http://www.adaptivecomputing.__com/resources/downloads/__torque/CHANGELOGS/torque-2.5.__10.CHANGELOG
> <http://www.adaptivecomputing.com/resources/downloads/torque/CHANGELOGS/torque-2.5.10.CHANGELOG>
>
>
> I can see it on my torques server as well
>
> lsof -c pbs_server | grep munge | wc -l
> 606
>
> I get this:
>
>
> lsof -c pbs_server | grep munge | wc -l
> 0
>
>
>
> this number reaches upto 2000.
>
>
> That looks like the culprit, alright. So why don't we get it too? Weird.
>
>
> Steve
>
>
> --
> Steve Jones [log in to unmask]
> <mailto:[log in to unmask]>
> System Administrator office: 220
> High Energy Physics Division tel (int): 42334
> Oliver Lodge Laboratory tel (ext): +44 (0)151 794
> 2334 <tel:%2B44%20%280%29151%20794%202334>
> University of Liverpool
> http://www.liv.ac.uk/physics/__hep/ <http://www.liv.ac.uk/physics/hep/>
>
>
|