Hi
Since moving to emi2 creamce and torque server, our cluster has become quite unstable. We have a separate batch server fed by two creamce and around 1300 job slots.
We are using torque-server-2.5.7-7.el5 and munge is enabled. Pbs_server is crashing periodically and log is full of this kind of error
PBS_Server: LOG_ERROR::Too many open files (24) in job_save, open for full save
LOG_ERROR::stream_eof, connection to t2wn71.physics.ox.ac.uk is bad, remote service may be down, message may be corrupt, or connection may have been dropped remotely (End of File). setting node state to down
I know that there is a bug in torque-server-2.5.7 that it opens a lot of munge credential file and doesn't close it properly
http://www.adaptivecomputing.com/resources/downloads/torque/CHANGELOGS/torque-2.5.10.CHANGELOG
I can see it on my torques server as well
lsof -c pbs_server | grep munge | wc -l
606
Sometime this number reaches upto 2000. Apparently this issue has been solved in torque-server-2.5.9 but rpm is not available through epel repos.
I was wondering that whether others are also seeing same kind of problem and how they have fixed it.
Thanks
Kashif
|