On Wed, Dec 10, 2008 at 9:29 AM, Michel Jouvin <[log in to unmask]> wrote: > Sorry, I missed the message just before the one I answered. The RPMs built > by Steve have many limits increased compared to default ones and is normally > suitable for large/very large configurations. The increased limits are described here. https://savannah.cern.ch/bugs/?33484 and are in this patch going through certification. https://savannah.cern.ch/patch/?2517 Steve > > Michel > > --On mercredi 10 décembre 2008 09:27 +0100 Michel Jouvin > <[log in to unmask]> wrote: > >> Jeff, >> >> I had no time to read the whole thread but just check ! AFAIK, this is >> (this was?) the last snapshot released by ClusterResources but Steve may >> answer more precisely. >> >> Michel >> >> --On mercredi 10 décembre 2008 09:16 +0100 Jeff Templon >> <[log in to unmask]> wrote: >> >>> Hi, >>> >>> are these RPMs the ones I asked for in my previous message?? ;-) >>> >>> JT >>> >>> On 10 Dec 2008, at 08:07, Michel Jouvin wrote: >>> >>>> Yves, >>>> >>>> We experienced such a behaviour (it was with reservations but I >>>> suspect it may be the same pb). You may give a try to the last >>>> snapshot built by Steve Traylen (but not officially released as part >>>> of gLite): >>>> >>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/torque/2.3. >>>> 0-2-2/ >>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/maui/3.2.6p >>>> 20-10/ >>>> >>>> We run it for a couple of month at GRIF and it solved all of our >>>> problems with typically 1500-2000 concurrent jobs running. >>>> >>>> Cheers, >>>> >>>> Michel >>>> >>>> --On mercredi 10 décembre 2008 07:01 +0100 Yves Kemp <[log in to unmask] >>>> > wrote: >>>> >>>>> Hi Ronald, >>>>> >>>>> thanks for the hint! >>>>> Unfortunately, it did not help: The files are recreated after a >>>>> couple >>>>> of minutes, and have only half the size, but the problem still >>>>> persists >>>>> as before, even if waiting for a longer time. >>>>> >>>>> Best >>>>> >>>>> Yves >>>>> >>>>> On 09.12.2008, at 21:20, Ronald Starink wrote: >>>>> >>>>>> Hi Yves, >>>>>> >>>>>> At Nikhef we also see this from time to time. It is caused by an >>>>>> internal table for Maui getting full. Our workaround is the >>>>>> following: >>>>>> >>>>>> service maui stop >>>>>> rm /var/spool/maui/maui.ck* >>>>>> service maui start >>>>>> >>>>>> Cheers, >>>>>> Ronald >>>>>> >>>>>> >>>>>> Yves Kemp wrote: >>>>>>> >>>>>>> Dear all, >>>>>>> >>>>>>> we currently have a problem with Maui on one of our batch servers: >>>>>>> diagnose -f does not completely report the stanza below >>>>>>> GROUP >>>>>>> (5 unix groups are missing, although defined in the pbs server and >>>>>>> maui.cfg, and using resources) >>>>>>> QOS >>>>>>> No entry at all (even "QOS" is missing) >>>>>>> CLASS >>>>>>> The same as for QOS >>>>>>> >>>>>>> The problem appeared roughly one week ago. Two events might be >>>>>>> correlated: We introduced new hardware shortly before, the system >>>>>>> was >>>>>>> under very heavy load (~5000 jobs in queue and running), and we >>>>>>> configured a second CE for some time, on a testing basis (now >>>>>>> removed, >>>>>>> PBS configuration reverted back). >>>>>>> >>>>>>> We run one CE (grid-ce3.desy.de) in front of this batch server. >>>>>>> Some >>>>>>> information relevant to the batch server: >>>>>>> glite-apel-pbs-2.0.5-2.noarch >>>>>>> maui-3.2.6p20-snap.1182974819.8.slc4.i386 >>>>>>> maui-client-3.2.6p20-snap.1182974819.8.slc4.i386 >>>>>>> maui-server-3.2.6p20-snap.1182974819.8.slc4.i386 >>>>>>> torque-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>> root@grid-batch3: [~] uname -a >>>>>>> Linux grid-batch3.desy.de 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 >>>>>>> 12:59:28 >>>>>>> CDT 2008 i686 i686 i386 GNU/Linux >>>>>>> root@grid-batch3: [~] cat /etc/issue >>>>>>> Scientific Linux SL release 4.4 (Beryllium) >>>>>>> Kernel \r on an \m >>>>>>> >>>>>>> An example output can be found here: >>>>>>> http://www.desy.de/~kemp/diagnose.txt >>>>>>> >>>>>>> I have put the config files here: >>>>>>> http://www.desy.de/~kemp/pbs_server.conf >>>>>>> http://www.desy.de/~kemp/maui.cfg >>>>>>> The information about the fairshare seems to be there, as shown >>>>>>> e.g. in >>>>>>> the file /var/spool/maui/stats/FS.1228780800 >>>>>>> http://www.desy.de/~kemp/FS.1228780800 >>>>>>> so we assume that scheduling is not affected (but we do not really >>>>>>> know...). >>>>>>> >>>>>>> >>>>>>> Does anyone have an idea what is going wrong? >>>>>>> >>>>>>> Thanks for any hint! >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> Yves >>>>>>> >>>>>>> # Yves Kemp: [log in to unmask] >>>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg >>>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318 >>>>> >>>>> # Yves Kemp: [log in to unmask] >>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg >>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318 >>>> >>>> >>>> >>>> ************************************************************* >>>> * Michel Jouvin Email : [log in to unmask] * >>>> * LAL / CNRS Tel : +33 1 64468932 * >>>> * B.P. 34 Fax : +33 1 69079404 * >>>> * 91898 Orsay Cedex * >>>> * France * >>>> ************************************************************* >> >> >> >> ************************************************************* >> * Michel Jouvin Email : [log in to unmask] * >> * LAL / CNRS Tel : +33 1 64468932 * >> * B.P. 34 Fax : +33 1 69079404 * >> * 91898 Orsay Cedex * >> * France * >> ************************************************************* >> > > > > ************************************************************* > * Michel Jouvin Email : [log in to unmask] * > * LAL / CNRS Tel : +33 1 64468932 * > * B.P. 34 Fax : +33 1 69079404 * > * 91898 Orsay Cedex * > * France * > ************************************************************* > -- Steve Traylen