On Wed, Dec 10, 2008 at 10:45 AM, Andreas Gellrich <[log in to unmask]> wrote: > Hi Stev, > may we try the new MAUI rpms. Where are they? Andreas, Follow the names in the savannah patch and grab them from the link. http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui current versions in patch are Torque2.3.5-1 and Maui=3.2.6p21-2 Note these have massively had very little testing , exactly why they are in certification. That said if you do try them then I'm happy to have feedback. Steve > > Thanx > Andreas > > On Wed, 10 Dec 2008, Steve Traylen wrote: > >> On Wed, Dec 10, 2008 at 9:29 AM, Michel Jouvin <[log in to unmask]> >> wrote: >>> >>> Sorry, I missed the message just before the one I answered. The RPMs >>> built >>> by Steve have many limits increased compared to default ones and is >>> normally >>> suitable for large/very large configurations. >> >> The increased limits are described here. >> >> https://savannah.cern.ch/bugs/?33484 >> >> and are in this patch going through certification. >> >> https://savannah.cern.ch/patch/?2517 >> >> Steve >>> >>> Michel >>> >>> --On mercredi 10 décembre 2008 09:27 +0100 Michel Jouvin >>> <[log in to unmask]> wrote: >>> >>>> Jeff, >>>> >>>> I had no time to read the whole thread but just check ! AFAIK, this is >>>> (this was?) the last snapshot released by ClusterResources but Steve may >>>> answer more precisely. >>>> >>>> Michel >>>> >>>> --On mercredi 10 décembre 2008 09:16 +0100 Jeff Templon >>>> <[log in to unmask]> wrote: >>>> >>>>> Hi, >>>>> >>>>> are these RPMs the ones I asked for in my previous message?? ;-) >>>>> >>>>> JT >>>>> >>>>> On 10 Dec 2008, at 08:07, Michel Jouvin wrote: >>>>> >>>>>> Yves, >>>>>> >>>>>> We experienced such a behaviour (it was with reservations but I >>>>>> suspect it may be the same pb). You may give a try to the last >>>>>> snapshot built by Steve Traylen (but not officially released as part >>>>>> of gLite): >>>>>> >>>>>> >>>>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/torque/2.3. >>>>>> 0-2-2/ >>>>>> >>>>>> http://eticssoft.web.cern.ch/eticssoft/repository/torquemaui/maui/3.2.6p >>>>>> 20-10/ >>>>>> >>>>>> We run it for a couple of month at GRIF and it solved all of our >>>>>> problems with typically 1500-2000 concurrent jobs running. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Michel >>>>>> >>>>>> --On mercredi 10 décembre 2008 07:01 +0100 Yves Kemp >>>>>> <[log in to unmask] >>>>>>> >>>>>>> wrote: >>>>>> >>>>>>> Hi Ronald, >>>>>>> >>>>>>> thanks for the hint! >>>>>>> Unfortunately, it did not help: The files are recreated after a >>>>>>> couple >>>>>>> of minutes, and have only half the size, but the problem still >>>>>>> persists >>>>>>> as before, even if waiting for a longer time. >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> Yves >>>>>>> >>>>>>> On 09.12.2008, at 21:20, Ronald Starink wrote: >>>>>>> >>>>>>>> Hi Yves, >>>>>>>> >>>>>>>> At Nikhef we also see this from time to time. It is caused by an >>>>>>>> internal table for Maui getting full. Our workaround is the >>>>>>>> following: >>>>>>>> >>>>>>>> service maui stop >>>>>>>> rm /var/spool/maui/maui.ck* >>>>>>>> service maui start >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Ronald >>>>>>>> >>>>>>>> >>>>>>>> Yves Kemp wrote: >>>>>>>>> >>>>>>>>> Dear all, >>>>>>>>> >>>>>>>>> we currently have a problem with Maui on one of our batch servers: >>>>>>>>> diagnose -f does not completely report the stanza below >>>>>>>>> GROUP >>>>>>>>> (5 unix groups are missing, although defined in the pbs server and >>>>>>>>> maui.cfg, and using resources) >>>>>>>>> QOS >>>>>>>>> No entry at all (even "QOS" is missing) >>>>>>>>> CLASS >>>>>>>>> The same as for QOS >>>>>>>>> >>>>>>>>> The problem appeared roughly one week ago. Two events might be >>>>>>>>> correlated: We introduced new hardware shortly before, the system >>>>>>>>> was >>>>>>>>> under very heavy load (~5000 jobs in queue and running), and we >>>>>>>>> configured a second CE for some time, on a testing basis (now >>>>>>>>> removed, >>>>>>>>> PBS configuration reverted back). >>>>>>>>> >>>>>>>>> We run one CE (grid-ce3.desy.de) in front of this batch server. >>>>>>>>> Some >>>>>>>>> information relevant to the batch server: >>>>>>>>> glite-apel-pbs-2.0.5-2.noarch >>>>>>>>> maui-3.2.6p20-snap.1182974819.8.slc4.i386 >>>>>>>>> maui-client-3.2.6p20-snap.1182974819.8.slc4.i386 >>>>>>>>> maui-server-3.2.6p20-snap.1182974819.8.slc4.i386 >>>>>>>>> torque-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>>>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>>>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>>>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386 >>>>>>>>> root@grid-batch3: [~] uname -a >>>>>>>>> Linux grid-batch3.desy.de 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 >>>>>>>>> 12:59:28 >>>>>>>>> CDT 2008 i686 i686 i386 GNU/Linux >>>>>>>>> root@grid-batch3: [~] cat /etc/issue >>>>>>>>> Scientific Linux SL release 4.4 (Beryllium) >>>>>>>>> Kernel \r on an \m >>>>>>>>> >>>>>>>>> An example output can be found here: >>>>>>>>> http://www.desy.de/~kemp/diagnose.txt >>>>>>>>> >>>>>>>>> I have put the config files here: >>>>>>>>> http://www.desy.de/~kemp/pbs_server.conf >>>>>>>>> http://www.desy.de/~kemp/maui.cfg >>>>>>>>> The information about the fairshare seems to be there, as shown >>>>>>>>> e.g. in >>>>>>>>> the file /var/spool/maui/stats/FS.1228780800 >>>>>>>>> http://www.desy.de/~kemp/FS.1228780800 >>>>>>>>> so we assume that scheduling is not affected (but we do not really >>>>>>>>> know...). >>>>>>>>> >>>>>>>>> >>>>>>>>> Does anyone have an idea what is going wrong? >>>>>>>>> >>>>>>>>> Thanks for any hint! >>>>>>>>> >>>>>>>>> Best >>>>>>>>> >>>>>>>>> Yves >>>>>>>>> >>>>>>>>> # Yves Kemp: [log in to unmask] >>>>>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg >>>>>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318 >>>>>>> >>>>>>> # Yves Kemp: [log in to unmask] >>>>>>> # DESY IT 2b/314, Notkestr. 85, D-22607 Hamburg >>>>>>> # FON: +49-(0)-40-8998-2318, FAX: +49-(0)-40-8994-2318 >>>>>> >>>>>> >>>>>> >>>>>> ************************************************************* >>>>>> * Michel Jouvin Email : [log in to unmask] * >>>>>> * LAL / CNRS Tel : +33 1 64468932 * >>>>>> * B.P. 34 Fax : +33 1 69079404 * >>>>>> * 91898 Orsay Cedex * >>>>>> * France * >>>>>> ************************************************************* >>>> >>>> >>>> >>>> ************************************************************* >>>> * Michel Jouvin Email : [log in to unmask] * >>>> * LAL / CNRS Tel : +33 1 64468932 * >>>> * B.P. 34 Fax : +33 1 69079404 * >>>> * 91898 Orsay Cedex * >>>> * France * >>>> ************************************************************* >>>> >>> >>> >>> >>> ************************************************************* >>> * Michel Jouvin Email : [log in to unmask] * >>> * LAL / CNRS Tel : +33 1 64468932 * >>> * B.P. 34 Fax : +33 1 69079404 * >>> * 91898 Orsay Cedex * >>> * France * >>> ************************************************************* >>> >> >> >> >> -- >> Steve Traylen >> > > ---- > Andreas Gellrich <[log in to unmask]> > DESY IT / Grid Computing > http://www.desy.de/~gellrich -- Steve Traylen