JISCMail - LCG-ROLLOUT Archives

On Wed, Dec 10, 2008 at 10:45 AM, Andreas Gellrich
<[log in to unmask]> wrote:
> Hi Stev,
> may we try the new MAUI rpms. Where are they?
Andreas,

Follow the names in the savannah patch and grab them from the link.

http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui

current versions in patch are Torque2.3.5-1    and
 Maui=3.2.6p21-2

Note these have massively had very little testing , exactly
why they are in certification. That said if you do try
them then I'm happy to have feedback.
Steve






>
> Thanx
> Andreas
>
> On Wed, 10 Dec 2008, Steve Traylen wrote:
>
>> On Wed, Dec 10, 2008 at 9:29 AM, Michel Jouvin <[log in to unmask]>
>> wrote:
>>>
>>> Sorry, I missed the message just before the one I answered. The RPMs
>>> built
>>> by Steve have many limits increased compared to default ones and is
>>> normally
>>> suitable for large/very large configurations.
>>
>> The increased limits are described here.
>>
>> https://savannah.cern.ch/bugs/?33484
>>
>> and are in this patch going through certification.
>>
>> https://savannah.cern.ch/patch/?2517
>>
>>  Steve
>>>
>>> Michel
>>>
>>> --On mercredi 10 décembre 2008 09:27 +0100 Michel Jouvin
>>> <[log in to unmask]> wrote:
>>>
>>>> Jeff,
>>>>
>>>> I had no time to read the whole thread but just check ! AFAIK, this is
>>>> (this was?) the last snapshot released by ClusterResources but Steve may
>>>> answer more precisely.
>>>>
>>>> Michel
>>>>
>>>> --On mercredi 10 décembre 2008 09:16 +0100 Jeff Templon
>>>> <[log in to unmask]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> are these RPMs the ones I asked for in my previous message?? ;-)
>>>>>
>>>>>                               JT
>>>>>
>>>>> On 10 Dec 2008, at 08:07, Michel Jouvin wrote:
>>>>>
>>>>>> Yves,
>>>>>>
>>>>>> We experienced such a behaviour (it was with reservations but I
>>>>>> suspect it may be the same pb). You may give a try to the last
>>>>>> snapshot built by Steve Traylen (but not officially released as part
>>>>>> of gLite):
>>>>>>
>>>>>>
>>>>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/torque/2.3.
>>>>>> 0-2-2/
>>>>>>
>>>>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/maui/3.2.6p
>>>>>> 20-10/
>>>>>>
>>>>>> We run it for a couple of month at GRIF and it solved all of our
>>>>>> problems with typically 1500-2000 concurrent jobs running.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Michel
>>>>>>
>>>>>> --On mercredi 10 décembre 2008 07:01 +0100 Yves Kemp
>>>>>> <[log in to unmask]
>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ronald,
>>>>>>>
>>>>>>> thanks for the hint!
>>>>>>> Unfortunately, it did not help: The files are recreated after a
>>>>>>> couple
>>>>>>> of minutes, and have only half the size, but the problem still
>>>>>>> persists
>>>>>>> as before, even if waiting for a longer time.
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Yves
>>>>>>>
>>>>>>> On 09.12.2008, at 21:20, Ronald Starink wrote:
>>>>>>>
>>>>>>>> Hi Yves,
>>>>>>>>
>>>>>>>> At Nikhef we also see this from time to time. It is caused by an
>>>>>>>> internal table for Maui getting full. Our workaround is the
>>>>>>>> following:
>>>>>>>>
>>>>>>>> service maui stop
>>>>>>>> rm /var/spool/maui/maui.ck*
>>>>>>>> service maui start
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ronald
>>>>>>>>
>>>>>>>>
>>>>>>>> Yves Kemp wrote:
>>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> we currently have a problem with Maui on one of our batch servers:
>>>>>>>>> diagnose -f does not completely report the stanza below
>>>>>>>>> GROUP
>>>>>>>>> (5 unix groups are missing, although defined in the pbs server and
>>>>>>>>> maui.cfg, and using resources)
>>>>>>>>> QOS
>>>>>>>>> No entry at all (even "QOS" is missing)
>>>>>>>>> CLASS
>>>>>>>>> The same as for QOS
>>>>>>>>>
>>>>>>>>> The problem appeared roughly one week ago. Two events might be
>>>>>>>>> correlated: We introduced new hardware shortly before, the system
>>>>>>>>> was
>>>>>>>>> under very heavy load (~5000 jobs in queue and running), and we
>>>>>>>>> configured a second CE for some time, on a testing basis (now
>>>>>>>>> removed,
>>>>>>>>> PBS configuration reverted back).
>>>>>>>>>
>>>>>>>>> We run one CE (grid-ce3.desy.de) in front of this batch server.
>>>>>>>>> Some
>>>>>>>>> information relevant to the batch server:
>>>>>>>>> glite-apel-pbs-2.0.5-2.noarch
>>>>>>>>> maui-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>>>>> maui-client-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>>>>> maui-server-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>>>>> torque-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>>>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>>>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>>>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>>>> root@grid-batch3: [~] uname -a
>>>>>>>>> Linux grid-batch3.desy.de 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5
>>>>>>>>> 12:59:28
>>>>>>>>> CDT 2008 i686 i686 i386 GNU/Linux
>>>>>>>>> root@grid-batch3: [~] cat /etc/issue
>>>>>>>>> Scientific Linux SL release 4.4 (Beryllium)
>>>>>>>>> Kernel \r on an \m
>>>>>>>>>
>>>>>>>>> An example output can be found here:
>>>>>>>>> http://www.desy.de/~kemp/diagnose.txt
>>>>>>>>>
>>>>>>>>> I have put the config files here:
>>>>>>>>> http://www.desy.de/~kemp/pbs_server.conf
>>>>>>>>> http://www.desy.de/~kemp/maui.cfg
>>>>>>>>> The information about the fairshare seems to be there, as shown
>>>>>>>>> e.g. in
>>>>>>>>> the file /var/spool/maui/stats/FS.1228780800
>>>>>>>>> http://www.desy.de/~kemp/FS.1228780800
>>>>>>>>> so we assume that scheduling is not affected (but we do not really
>>>>>>>>> know...).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does anyone have an idea what is going wrong?
>>>>>>>>>
>>>>>>>>> Thanks for any hint!
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>>
>>>>>>>>> Yves
>>>>>>>>>
>>>>>>>>> # Yves Kemp:  [log in to unmask]
>>>>>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>>>>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>>>>>>>
>>>>>>> # Yves Kemp:  [log in to unmask]
>>>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>>>>>>
>>>>>>
>>>>>>
>>>>>>  *************************************************************
>>>>>>  * Michel Jouvin                 Email : [log in to unmask] *
>>>>>>  * LAL / CNRS                    Tel : +33 1 64468932        *
>>>>>>  * B.P. 34                       Fax : +33 1 69079404        *
>>>>>>  * 91898 Orsay Cedex                                         *
>>>>>>  * France                                                    *
>>>>>>  *************************************************************
>>>>
>>>>
>>>>
>>>>    *************************************************************
>>>>    * Michel Jouvin                 Email : [log in to unmask] *
>>>>    * LAL / CNRS                    Tel : +33 1 64468932        *
>>>>    * B.P. 34                       Fax : +33 1 69079404        *
>>>>    * 91898 Orsay Cedex                                         *
>>>>    * France                                                    *
>>>>    *************************************************************
>>>>
>>>
>>>
>>>
>>>   *************************************************************
>>>   * Michel Jouvin                 Email : [log in to unmask] *
>>>   * LAL / CNRS                    Tel : +33 1 64468932        *
>>>   * B.P. 34                       Fax : +33 1 69079404        *
>>>   * 91898 Orsay Cedex                                         *
>>>   * France                                                    *
>>>   *************************************************************
>>>
>>
>>
>>
>> --
>> Steve Traylen
>>
>
> ----
>  Andreas Gellrich <[log in to unmask]>
>  DESY IT / Grid Computing
>  http://www.desy.de/~gellrich



-- 
Steve Traylen