Sorry, I missed the message just before the one I answered. The RPMs built
by Steve have many limits increased compared to default ones and is
normally suitable for large/very large configurations.
Michel
--On mercredi 10 décembre 2008 09:27 +0100 Michel Jouvin
<[log in to unmask]> wrote:
> Jeff,
>
> I had no time to read the whole thread but just check ! AFAIK, this is
> (this was?) the last snapshot released by ClusterResources but Steve may
> answer more precisely.
>
> Michel
>
> --On mercredi 10 décembre 2008 09:16 +0100 Jeff Templon
> <[log in to unmask]> wrote:
>
>> Hi,
>>
>> are these RPMs the ones I asked for in my previous message?? ;-)
>>
>> JT
>>
>> On 10 Dec 2008, at 08:07, Michel Jouvin wrote:
>>
>>> Yves,
>>>
>>> We experienced such a behaviour (it was with reservations but I
>>> suspect it may be the same pb). You may give a try to the last
>>> snapshot built by Steve Traylen (but not officially released as part
>>> of gLite):
>>>
>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/torque/2.3.
>>> 0-2-2/
>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/maui/3.2.6p
>>> 20-10/
>>>
>>> We run it for a couple of month at GRIF and it solved all of our
>>> problems with typically 1500-2000 concurrent jobs running.
>>>
>>> Cheers,
>>>
>>> Michel
>>>
>>> --On mercredi 10 décembre 2008 07:01 +0100 Yves Kemp <[log in to unmask]
>>> > wrote:
>>>
>>>> Hi Ronald,
>>>>
>>>> thanks for the hint!
>>>> Unfortunately, it did not help: The files are recreated after a
>>>> couple
>>>> of minutes, and have only half the size, but the problem still
>>>> persists
>>>> as before, even if waiting for a longer time.
>>>>
>>>> Best
>>>>
>>>> Yves
>>>>
>>>> On 09.12.2008, at 21:20, Ronald Starink wrote:
>>>>
>>>>> Hi Yves,
>>>>>
>>>>> At Nikhef we also see this from time to time. It is caused by an
>>>>> internal table for Maui getting full. Our workaround is the
>>>>> following:
>>>>>
>>>>> service maui stop
>>>>> rm /var/spool/maui/maui.ck*
>>>>> service maui start
>>>>>
>>>>> Cheers,
>>>>> Ronald
>>>>>
>>>>>
>>>>> Yves Kemp wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> we currently have a problem with Maui on one of our batch servers:
>>>>>> diagnose -f does not completely report the stanza below
>>>>>> GROUP
>>>>>> (5 unix groups are missing, although defined in the pbs server and
>>>>>> maui.cfg, and using resources)
>>>>>> QOS
>>>>>> No entry at all (even "QOS" is missing)
>>>>>> CLASS
>>>>>> The same as for QOS
>>>>>>
>>>>>> The problem appeared roughly one week ago. Two events might be
>>>>>> correlated: We introduced new hardware shortly before, the system
>>>>>> was
>>>>>> under very heavy load (~5000 jobs in queue and running), and we
>>>>>> configured a second CE for some time, on a testing basis (now
>>>>>> removed,
>>>>>> PBS configuration reverted back).
>>>>>>
>>>>>> We run one CE (grid-ce3.desy.de) in front of this batch server.
>>>>>> Some
>>>>>> information relevant to the batch server:
>>>>>> glite-apel-pbs-2.0.5-2.noarch
>>>>>> maui-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>> maui-client-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>> maui-server-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>> torque-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>> root@grid-batch3: [~] uname -a
>>>>>> Linux grid-batch3.desy.de 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5
>>>>>> 12:59:28
>>>>>> CDT 2008 i686 i686 i386 GNU/Linux
>>>>>> root@grid-batch3: [~] cat /etc/issue
>>>>>> Scientific Linux SL release 4.4 (Beryllium)
>>>>>> Kernel \r on an \m
>>>>>>
>>>>>> An example output can be found here:
>>>>>> http://www.desy.de/~kemp/diagnose.txt
>>>>>>
>>>>>> I have put the config files here:
>>>>>> http://www.desy.de/~kemp/pbs_server.conf
>>>>>> http://www.desy.de/~kemp/maui.cfg
>>>>>> The information about the fairshare seems to be there, as shown
>>>>>> e.g. in
>>>>>> the file /var/spool/maui/stats/FS.1228780800
>>>>>> http://www.desy.de/~kemp/FS.1228780800
>>>>>> so we assume that scheduling is not affected (but we do not really
>>>>>> know...).
>>>>>>
>>>>>>
>>>>>> Does anyone have an idea what is going wrong?
>>>>>>
>>>>>> Thanks for any hint!
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Yves
>>>>>>
>>>>>> # Yves Kemp: [log in to unmask]
>>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>>>>
>>>> # Yves Kemp: [log in to unmask]
>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>>>
>>>
>>>
>>> *************************************************************
>>> * Michel Jouvin Email : [log in to unmask] *
>>> * LAL / CNRS Tel : +33 1 64468932 *
>>> * B.P. 34 Fax : +33 1 69079404 *
>>> * 91898 Orsay Cedex *
>>> * France *
>>> *************************************************************
>
>
>
> *************************************************************
> * Michel Jouvin Email : [log in to unmask] *
> * LAL / CNRS Tel : +33 1 64468932 *
> * B.P. 34 Fax : +33 1 69079404 *
> * 91898 Orsay Cedex *
> * France *
> *************************************************************
>
*************************************************************
* Michel Jouvin Email : [log in to unmask] *
* LAL / CNRS Tel : +33 1 64468932 *
* B.P. 34 Fax : +33 1 69079404 *
* 91898 Orsay Cedex *
* France *
*************************************************************
|