On 3 Mar 2010, at 09:28, Douglas McNab wrote:
> Hi,
>
> I would add that you can have different versions of torque server and client and it will work.
> It just depends how you do the upgrade.
>
We had problems when the MOM version was more recent than the server version.
If maui versions differ between CEs and the server then diagnose (& friends) may stop working due to the shared secret differing. I have a vague recollection that if this is the case when yaim is run, then the info system will not setup the max jobs part of the information provider.
> When you upgrade the workers you have to make sure you clean out all the running jobs including all the temp files used by torque or else you end up with unreadable job files which can kill the mom. Just doing an upgrade with running jobs will mean you will lose jobs.
>
> The same thing goes for the server. You need to make sure you clean it out properly in terms of temp files that it keep about running jobs. Ideally you should purge your system of all running jobs or else you risk segfaults.
>
> It makes it harder to upgrade but it works. We have left torque client version and maui version on our SL4 machines alone currently at 2.3.0 and upgraded our Torque server to SL5 2.3.6. This all worked for a while. Then our MOMs started to segfault when it was a ping sent from the server. I have actually been banging on about this since November: http://scotgrid.blogspot.com/2009/11/segfaulting-pbsmoms.html and since then we have been building our own versions of Torque:
> http://scotgrid.blogspot.com/2010/01/pick-torque-any-torque.html . Steve T picked up on this and has rolled our recommended version into the epel repo. This is not available through gLite yet.
>
We haven't seen this problem with our SL4 WNs using 2.3.0 and our SL4 server running 2.3.6 (and SL5 WNs with 2.3.6).
Derek
|