Hi
how many groups, users, and QoS do you have defined in the maui.cfg
file?
Observing the behavior of this maui.ck file, what it does is collect
some sort of information about every user/group/QoS seen on the
system. Including those already defined in maui.cfg. It may be the
case that you have so many users/groups/QoS defined in maui.cfg (or
running on the system?) that you overflow it immediately.
Just try this : make a backup of your maui.cfg file, and then in the
original, delete about half of the USERCFG/GROUPCFG lines. Restart
maui and see what happens.
There are a lot of these hard-coded limits in maui, I am all for
having the traylenator expand many of the table sizes (we have
problems here with RESERVATIONDEPTH, which seem to happen when a farm
exceeds 500 cores or so) in the standard EGEE torque packages ...
JT
On 10 Dec 2008, at 07:01, Yves Kemp wrote:
> Hi Ronald,
>
> thanks for the hint!
> Unfortunately, it did not help: The files are recreated after a
> couple of minutes, and have only half the size, but the problem
> still persists as before, even if waiting for a longer time.
>
> Best
>
> Yves
>
> On 09.12.2008, at 21:20, Ronald Starink wrote:
>
>> Hi Yves,
>>
>> At Nikhef we also see this from time to time. It is caused by an
>> internal table for Maui getting full. Our workaround is the
>> following:
>>
>> service maui stop
>> rm /var/spool/maui/maui.ck*
>> service maui start
>>
>> Cheers,
>> Ronald
>>
>>
>> Yves Kemp wrote:
>>> Dear all,
>>>
>>> we currently have a problem with Maui on one of our batch servers:
>>> diagnose -f does not completely report the stanza below
>>> GROUP
>>> (5 unix groups are missing, although defined in the pbs server and
>>> maui.cfg, and using resources)
>>> QOS
>>> No entry at all (even "QOS" is missing)
>>> CLASS
>>> The same as for QOS
>>>
>>> The problem appeared roughly one week ago. Two events might be
>>> correlated: We introduced new hardware shortly before, the system
>>> was
>>> under very heavy load (~5000 jobs in queue and running), and we
>>> configured a second CE for some time, on a testing basis (now
>>> removed,
>>> PBS configuration reverted back).
>>>
>>> We run one CE (grid-ce3.desy.de) in front of this batch server. Some
>>> information relevant to the batch server:
>>> glite-apel-pbs-2.0.5-2.noarch
>>> maui-3.2.6p20-snap.1182974819.8.slc4.i386
>>> maui-client-3.2.6p20-snap.1182974819.8.slc4.i386
>>> maui-server-3.2.6p20-snap.1182974819.8.slc4.i386
>>> torque-2.3.0-snap.200801151629.2cri.slc4.i386
>>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
>>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
>>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
>>> root@grid-batch3: [~] uname -a
>>> Linux grid-batch3.desy.de 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5
>>> 12:59:28
>>> CDT 2008 i686 i686 i386 GNU/Linux
>>> root@grid-batch3: [~] cat /etc/issue
>>> Scientific Linux SL release 4.4 (Beryllium)
>>> Kernel \r on an \m
>>>
>>> An example output can be found here:
>>> http://www.desy.de/~kemp/diagnose.txt
>>>
>>> I have put the config files here:
>>> http://www.desy.de/~kemp/pbs_server.conf
>>> http://www.desy.de/~kemp/maui.cfg
>>> The information about the fairshare seems to be there, as shown
>>> e.g. in
>>> the file /var/spool/maui/stats/FS.1228780800
>>> http://www.desy.de/~kemp/FS.1228780800
>>> so we assume that scheduling is not affected (but we do not really
>>> know...).
>>>
>>>
>>> Does anyone have an idea what is going wrong?
>>>
>>> Thanks for any hint!
>>>
>>> Best
>>>
>>> Yves
>>>
>>> # Yves Kemp: [log in to unmask]
>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
>
> # Yves Kemp: [log in to unmask]
> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg
> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318
|