Hi Kashif,
At RAL we are using Torque 2.5.12 with the munge API patch (built ourselves). The old version of Torque in EPEL was far too unreliable for us to use.
I've done some basic testing with Torque 4.1.3 and this seems to work fine with an EMI-2 CREAM CE without any special modifications required (although earlier versions of Torque 4 had problems).
Regards,
Andrew.
________________________________________
From: LHC Computer Grid - Rollout [[log in to unmask]] on behalf of Kashif Mohammad [[log in to unmask]]
Sent: Monday, November 05, 2012 2:30 PM
To: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] pbs_server instability
Hi Andre
Thanks for your quick response. My next question is that what is the latest version of torque which is known to work with emi software without breaking anything else.
Thanks
Kashif
-----Original Message-----
From: LHC Computer Grid - Rollout [mailto:[log in to unmask]] On Behalf Of André Gemünd
Sent: 05 November 2012 12:39
To: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] pbs_server instability
Hello Kashif,
the version in EPEL is nowadays outdated. Steve Traylen (he was maintainer of the EPEL packages) has written to the list, that he does not continue building it.
Thus we are using a more recent build here.
The spec files of Torque have been fixed, I think it was last year or two years ago. Since then you can simply do a rpmbuild -ta (if you do not have any custom build options), though you might want to check the torque path setting (/var/spool/torque vs. /var/lib/torque vs. /var/torque etc.).
Greetings
André
----- Ursprüngliche Mail -----
> Hi
>
> Since moving to emi2 creamce and torque server, our cluster has
> become quite unstable. We have a separate batch server fed by two
> creamce and around 1300 job slots.
> We are using torque-server-2.5.7-7.el5 and munge is enabled.
> Pbs_server is crashing periodically and log is full of this kind of
> error
>
> PBS_Server: LOG_ERROR::Too many open files (24) in job_save, open for
> full save
>
> LOG_ERROR::stream_eof, connection to t2wn71.physics.ox.ac.uk is bad,
> remote service may be down, message may be corrupt, or connection
> may have been dropped remotely (End of File). setting node state to
> down
>
>
> I know that there is a bug in torque-server-2.5.7 that it opens a lot
> of munge credential file and doesn't close it properly
> http://www.adaptivecomputing.com/resources/downloads/torque/CHANGELOGS/torque-2.5.10.CHANGELOG
>
> I can see it on my torques server as well
>
> lsof -c pbs_server | grep munge | wc -l
> 606
>
> Sometime this number reaches upto 2000. Apparently this issue has
> been solved in torque-server-2.5.9 but rpm is not available through
> epel repos.
>
> I was wondering that whether others are also seeing same kind of
> problem and how they have fixed it.
>
> Thanks
> Kashif
>
--
André Gemünd
Fraunhofer-Institute for Algorithms and Scientific Computing
[log in to unmask]
Tel: +49 2241 14-2193
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
--
Scanned by iCritical.
|