Opened a bug:
https://bugzilla.redhat.com/show_bug.cgi?id=758740
Steve.
On Wed, Nov 30, 2011 at 2:51 PM, Peter Solagna <[log in to unmask]> wrote:
> On 29 November 2011 16:31, Stuart Purdie <[log in to unmask]> wrote:
>> Since the update to Torque 2.5.7, we've had consistent memory issue on our Torque server.
>>
>> We're not able to delve too deeply into the guts of this, mostly because with 2k job slots and 30k - 50k job slots tools like Valgrind become problematic.
>>
>> However, it looks like this is known about, e.g. http://www.clusterresources.com/bugzilla/show_bug.cgi?id=144 and looks like later versions of Torque help with this; e.g. http://www.clusterresources.com/pipermail/torquedev/2011-November/003886.html
>>
>> So, firstly I note that we are now scheduling nightly restarts of pbs_server (and that this has to be done carefully, because 'service pbs_server restart' doesn't work...), and considering if we need to make that more frequent, as it appears to grow to 6GB of memory used over
>>
>> We're not really happy with that sort of workaround - that makes it one of the most unstable services we use, by a long margin... And given the number of Grid services involved, that's an impressive achievement. Therefore:
>>
>> Is this similar to observations at other sites? (i.e. if we're doing something wrong / to aggravate the problem, that would be good to know).
>>
>> Finally, assuming that there's nothing obviously wrong that we are doing:
>>
>> Is anyone aware of a planned timescale for getting more recent versions of Torque through the repo's? Or if it is sensible to consider compiling it ourselves?
>
> Hi Stuart,
>
> did you also opened a GGUS ticket about this issue? If not, may I ask
> you to open one?
> It seems to be an important issue and should be tracked also there.
>
> Thanks
>
> Cheers
> Peter
>
> --
> Peter Solagna
> EGI.eu - Operations Officer
> email: [log in to unmask]
> skype: peter.solagna.egi
--
Steve Traylen
|