Yves,
We experienced such a behaviour (it was with reservations but I suspect it
may be the same pb). You may give a try to the last snapshot built by Steve
Traylen (but not officially released as part of gLite):
http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/torque/2.3.0-2-2/
http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/maui/3.2.6p20-10/
We run it for a couple of month at GRIF and it solved all of our problems
with typically 1500-2000 concurrent jobs running.
Cheers,
Michel
--On mercredi 10 décembre 2008 07:01 +0100 Yves Kemp <[log in to unmask]>
wrote:
> Hi Ronald,
>
> thanks for the hint!
> Unfortunately, it did not help: The files are recreated after a couple
> of minutes, and have only half the size, but the problem still persists
> as before, even if waiting for a longer time.
>
> Best
>
> Yves
>
> On 09.12.2008, at 21:20, Ronald Starink wrote:
>
>> Hi Yves,
>>
>> At Nikhef we also see this from time to time. It is caused by an
>> internal table for Maui getting full. Our workaround is the following:
>>
>> service maui stop
>> rm /var/spool/maui/maui.ck*
>> service maui start
>>
>> Cheers,
>> Ronald
>>
>>
>> Yves Kemp wrote:
>>> Dear all,
>>>
>>> we currently have a problem with Maui on one of our batch servers:
>>> diagnose -f does not completely report the stanza below
>>> GROUP
>>> (5 unix groups are missing, although defined in the pbs server and
>>> maui.cfg, and using resources)
>>> QOS
>>> No entry at all (even "QOS" is missing)
>>> CLASS
>>> The same as for QOS
>>>
>>> The problem appeared roughly one week ago. Two events might be
>>> correlated: We introduced new hardware shortly before, the system was
>>> under very heavy load (~5000 jobs in queue and running), and we
>>> configured a second CE for some time, on a testing basis (now
>>> removed,
>>> PBS configuration reverted back).
>>>
>>> We run one CE (grid-ce3.desy.de) in front of this batch server. Some
>>> information relevant to the batch server:
>>> glite-apel-pbs-2.0.5-2.noarch
>>> maui-3.2.6p20-snap.1182974819.8.slc4.i386
>>> maui-client-3.2.6p20-snap.1182974819.8.slc4.i386
>>> maui-server-3.2.6p20-snap.1182974819.8.slc4.i386
>>> torque-2.3.0-snap.200801151629.2cri.slc4.i386
>>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
>>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
>>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
>>> root@grid-batch3: [~] uname -a
>>> Linux grid-batch3.desy.de 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5
>>> 12:59:28
>>> CDT 2008 i686 i686 i386 GNU/Linux
>>> root@grid-batch3: [~] cat /etc/issue
>>> Scientific Linux SL release 4.4 (Beryllium)
>>> Kernel \r on an \m
>>>
>>> An example output can be found here:
>>> http://www.desy.de/~kemp/diagnose.txt
>>>
>>> I have put the config files here:
>>> http://www.desy.de/~kemp/pbs_server.conf
>>> http://www.desy.de/~kemp/maui.cfg
>>> The information about the fairshare seems to be there, as shown
>>> e.g. in
>>> the file /var/spool/maui/stats/FS.1228780800
>>> http://www.desy.de/~kemp/FS.1228780800
>>> so we assume that scheduling is not affected (but we do not really
>>> know...).
>>>
>>>
>>> Does anyone have an idea what is going wrong?
>>>
>>> Thanks for any hint!
>>>
>>> Best
>>>
>>> Yves
>>>
>>> # Yves Kemp: [log in to unmask]
>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>
># Yves Kemp: [log in to unmask]
># DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
># FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
*************************************************************
* Michel Jouvin Email : [log in to unmask] *
* LAL / CNRS Tel : +33 1 64468932 *
* B.P. 34 Fax : +33 1 69079404 *
* 91898 Orsay Cedex *
* France *
*************************************************************
|