Hi
Thanks for all your responses. I think that culprit is older version of torque(2.5.7) which has many known bugs. We have another smaller cluster with same torque version and we are not seeing any of the issues. So it seems that these bugs become more apparent when cluster size crosses a critical number.
I can understand that EMI is not suppose to provide torque rpm packages but it should give information about latest torque version against which emi software has been tested.
>> I think we can share our packages which are based on RPMS from EPEL they are stable. If you're interested just let us know.
Thanks for your offer. I would like to test your rpm.
Cheers
Kashif
-----Original Message-----
From: LHC Computer Grid - Rollout [mailto:[log in to unmask]] On Behalf Of Lukasz Flis
Sent: 06 November 2012 10:08
To: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] pbs_server instability
Hello
The error you have mentioned (Too Many Open Files) might be related to
the problem with pbs_server <-> munge communication. Older versions were
leaking file descriptors.
If you have 2 cream servers - it's worth to have a look at
munge-api-patch. It will speed up things significantly and free server from
communication with munge daemon based on files and popen.
In CYFRONET_LCG2 we are running our custom-built Torque 2.5.12 with:
* munge-api-patch
* patch for pbs_mom segfault in tm_request (significant for many-core jobs)
* torque-memleak-gpgpu-v2.patch (pbs_server memory leak when GPGPUs are
used)
* torque-2.5.10-spread-polls-uniformly.patch (from Eygene Ryabinkin)
* compiled-in with debug symbols
* enabled core dumps for pbs_mom
We have no problems :).
I think we can share our packages which are based on RPMS from EPEL they
are stable. If you're interested just let us know.
Cheers
--
Lukasz Flis
On 06.11.2012 10:53, Sam Skipsey wrote:
> In general, the "too many open files" issue seems, experimentally, to be
> caused by not having a munge daemon running where there should be one.
>
> All services that need to authenticate to the batch system need a munge
> service, so it might be worth checking carefully to make sure that none
> of them have died.
>
> Sam
>
>
> On 6 November 2012 09:43, Stephen Jones <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
>
> On 11/05/2012 11:58 AM, Kashif Mohammad wrote:
>
> Since moving to emi2 creamce and torque server,
>
>
> We to have an emi2 creamce and torque server.
>
>
> We have a separate batch server fed by two creamce and around
> 1300 job slots.
>
>
> That's like Liverpool - slots is 952, current jobs is 950.
>
>
> We are using torque-server-2.5.7-7.el5 and munge is enabled.
>
>
> So are we.
>
>
> Pbs_server is crashing periodically and log is full of this kind
> of error
>
> PBS_Server: LOG_ERROR::Too many open files (24) in job_save,
> open for full save
>
> LOG_ERROR::stream_eof, connection to t2wn71.physics.ox.ac.uk
> <http://t2wn71.physics.ox.ac.uk> is bad, remote service may be
> down, message may be corrupt, or connection may have been
> dropped remotely (End of File). setting node state to down
>
>
> We get none of the first error. We get some of the second error
> (if/when the network or node is broken or busy I suspect).
>
>
> I know that there is a bug in torque-server-2.5.7 that it opens
> a lot of munge credential file and doesn't close it properly
> http://www.adaptivecomputing.__com/resources/downloads/__torque/CHANGELOGS/torque-2.5.__10.CHANGELOG
> <http://www.adaptivecomputing.com/resources/downloads/torque/CHANGELOGS/torque-2.5.10.CHANGELOG>
>
>
> I can see it on my torques server as well
>
> lsof -c pbs_server | grep munge | wc -l
> 606
>
> I get this:
>
>
> lsof -c pbs_server | grep munge | wc -l
> 0
>
>
>
> this number reaches upto 2000.
>
>
> That looks like the culprit, alright. So why don't we get it too? Weird.
>
>
> Steve
>
>
> --
> Steve Jones [log in to unmask]
> <mailto:[log in to unmask]>
> System Administrator office: 220
> High Energy Physics Division tel (int): 42334
> Oliver Lodge Laboratory tel (ext): +44 (0)151 794
> 2334 <tel:%2B44%20%280%29151%20794%202334>
> University of Liverpool
> http://www.liv.ac.uk/physics/__hep/ <http://www.liv.ac.uk/physics/hep/>
>
>
|