Hi Derek,

On 3 March 2010 09:57, Derek Ross <[log in to unmask]> wrote:

On 3 Mar 2010, at 09:28, Douglas McNab wrote:

> Hi,
>
> I would add that you can have different versions of torque server and client and it will work.
> It just depends how you do the upgrade.
>

We had problems when the MOM version was more recent than the server version.

If maui versions differ between CEs and the server then diagnose (& friends) may stop working due to the shared secret differing. I have a vague recollection that if this is the case when yaim is run, then the info system will not setup the max jobs part of the information provider.

Yes this is correct but you can work out the secret and easily patch the maui clients.  However, it is not an out of the box solution to be fair.


> When you upgrade the workers you have to make sure you clean out all the running jobs including all the temp files used by torque or else you end up with unreadable job files which can kill the mom.  Just doing an upgrade with running jobs will mean you will lose jobs.
>
> The same thing goes for the server.  You need to make sure you clean it out properly in terms of temp files that it keep about running jobs. Ideally you should purge your system of all running jobs or else you risk segfaults.
>
> It makes it harder to upgrade but it works.  We have left torque client version and maui version on our SL4 machines alone currently at 2.3.0 and upgraded our Torque server to SL5 2.3.6.  This all worked for a while.  Then our MOMs started to segfault when it was a ping sent from the server.  I have actually been banging on about this since November: http://scotgrid.blogspot.com/2009/11/segfaulting-pbsmoms.html and since then we have  been building our own versions of Torque:
> http://scotgrid.blogspot.com/2010/01/pick-torque-any-torque.html .  Steve T picked up on this and has rolled our recommended version into the epel repo.  This is not available through gLite yet.
>

We haven't seen this problem with our SL4 WNs using 2.3.0 and our SL4 server running 2.3.6 (and SL5 WNs with 2.3.6).

That's interesting as a few sites including Melbourne and another in UK have seen this issue.    We had to put it down to some race condition as even after running it in debug mode and reading the source we were still not able to debug it.  Therefore, it sounds like it might not effect all sites and be a function of site configuration/install.   Something for sites to be aware of.




Derek

Dug

--
ScotGrid, Room 481, Kelvin Building, University of Glasgow
tel: +44(0)141 330 6439