JISCMail - LCG-ROLLOUT Archives

On Wed, Dec 10, 2008 at 9:29 AM, Michel Jouvin <[log in to unmask]> wrote:
> Sorry, I missed the message just before the one I answered. The RPMs built
> by Steve have many limits increased compared to default ones and is normally
> suitable for large/very large configurations.

The increased limits are described here.

https://savannah.cern.ch/bugs/?33484

and are in this patch going through certification.

https://savannah.cern.ch/patch/?2517

   Steve
>
> Michel
>
> --On mercredi 10 décembre 2008 09:27 +0100 Michel Jouvin
> <[log in to unmask]> wrote:
>
>> Jeff,
>>
>> I had no time to read the whole thread but just check ! AFAIK, this is
>> (this was?) the last snapshot released by ClusterResources but Steve may
>> answer more precisely.
>>
>> Michel
>>
>> --On mercredi 10 décembre 2008 09:16 +0100 Jeff Templon
>> <[log in to unmask]> wrote:
>>
>>> Hi,
>>>
>>> are these RPMs the ones I asked for in my previous message?? ;-)
>>>
>>>                                JT
>>>
>>> On 10 Dec 2008, at 08:07, Michel Jouvin wrote:
>>>
>>>> Yves,
>>>>
>>>> We experienced such a behaviour (it was with reservations but I
>>>> suspect it may be the same pb). You may give a try to the last
>>>> snapshot built by Steve Traylen (but not officially released as part
>>>> of gLite):
>>>>
>>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/torque/2.3.
>>>> 0-2-2/
>>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/maui/3.2.6p
>>>> 20-10/
>>>>
>>>> We run it for a couple of month at GRIF and it solved all of our
>>>> problems with typically 1500-2000 concurrent jobs running.
>>>>
>>>> Cheers,
>>>>
>>>> Michel
>>>>
>>>> --On mercredi 10 décembre 2008 07:01 +0100 Yves Kemp <[log in to unmask]
>>>> > wrote:
>>>>
>>>>> Hi Ronald,
>>>>>
>>>>> thanks for the hint!
>>>>> Unfortunately, it did not help: The files are recreated after a
>>>>> couple
>>>>> of minutes, and have only half the size, but the problem still
>>>>> persists
>>>>> as before, even if waiting for a longer time.
>>>>>
>>>>> Best
>>>>>
>>>>> Yves
>>>>>
>>>>> On 09.12.2008, at 21:20, Ronald Starink wrote:
>>>>>
>>>>>> Hi Yves,
>>>>>>
>>>>>> At Nikhef we also see this from time to time. It is caused by an
>>>>>> internal table for Maui getting full. Our workaround is the
>>>>>> following:
>>>>>>
>>>>>> service maui stop
>>>>>> rm /var/spool/maui/maui.ck*
>>>>>> service maui start
>>>>>>
>>>>>> Cheers,
>>>>>> Ronald
>>>>>>
>>>>>>
>>>>>> Yves Kemp wrote:
>>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> we currently have a problem with Maui on one of our batch servers:
>>>>>>> diagnose -f does not completely report the stanza below
>>>>>>> GROUP
>>>>>>> (5 unix groups are missing, although defined in the pbs server and
>>>>>>> maui.cfg, and using resources)
>>>>>>> QOS
>>>>>>> No entry at all (even "QOS" is missing)
>>>>>>> CLASS
>>>>>>> The same as for QOS
>>>>>>>
>>>>>>> The problem appeared roughly one week ago. Two events might be
>>>>>>> correlated: We introduced new hardware shortly before, the system
>>>>>>> was
>>>>>>> under very heavy load (~5000 jobs in queue and running), and we
>>>>>>> configured a second CE for some time, on a testing basis (now
>>>>>>> removed,
>>>>>>> PBS configuration reverted back).
>>>>>>>
>>>>>>> We run one CE (grid-ce3.desy.de) in front of this batch server.
>>>>>>> Some
>>>>>>> information relevant to the batch server:
>>>>>>> glite-apel-pbs-2.0.5-2.noarch
>>>>>>> maui-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>>> maui-client-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>>> maui-server-3.2.6p20-snap.1182974819.8.slc4.i386
>>>>>>> torque-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
>>>>>>> root@grid-batch3: [~] uname -a
>>>>>>> Linux grid-batch3.desy.de 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5
>>>>>>> 12:59:28
>>>>>>> CDT 2008 i686 i686 i386 GNU/Linux
>>>>>>> root@grid-batch3: [~] cat /etc/issue
>>>>>>> Scientific Linux SL release 4.4 (Beryllium)
>>>>>>> Kernel \r on an \m
>>>>>>>
>>>>>>> An example output can be found here:
>>>>>>> http://www.desy.de/~kemp/diagnose.txt
>>>>>>>
>>>>>>> I have put the config files here:
>>>>>>> http://www.desy.de/~kemp/pbs_server.conf
>>>>>>> http://www.desy.de/~kemp/maui.cfg
>>>>>>> The information about the fairshare seems to be there, as shown
>>>>>>> e.g. in
>>>>>>> the file /var/spool/maui/stats/FS.1228780800
>>>>>>> http://www.desy.de/~kemp/FS.1228780800
>>>>>>> so we assume that scheduling is not affected (but we do not really
>>>>>>> know...).
>>>>>>>
>>>>>>>
>>>>>>> Does anyone have an idea what is going wrong?
>>>>>>>
>>>>>>> Thanks for any hint!
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Yves
>>>>>>>
>>>>>>> # Yves Kemp:  [log in to unmask]
>>>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>>>>>
>>>>> # Yves Kemp:  [log in to unmask]
>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>>>>
>>>>
>>>>
>>>>   *************************************************************
>>>>   * Michel Jouvin                 Email : [log in to unmask] *
>>>>   * LAL / CNRS                    Tel : +33 1 64468932        *
>>>>   * B.P. 34                       Fax : +33 1 69079404        *
>>>>   * 91898 Orsay Cedex                                         *
>>>>   * France                                                    *
>>>>   *************************************************************
>>
>>
>>
>>     *************************************************************
>>     * Michel Jouvin                 Email : [log in to unmask] *
>>     * LAL / CNRS                    Tel : +33 1 64468932        *
>>     * B.P. 34                       Fax : +33 1 69079404        *
>>     * 91898 Orsay Cedex                                         *
>>     * France                                                    *
>>     *************************************************************
>>
>
>
>
>    *************************************************************
>    * Michel Jouvin                 Email : [log in to unmask] *
>    * LAL / CNRS                    Tel : +33 1 64468932        *
>    * B.P. 34                       Fax : +33 1 69079404        *
>    * 91898 Orsay Cedex                                         *
>    * France                                                    *
>    *************************************************************
>



-- 
Steve Traylen