On Fri, Jul 10, 2009 at 2:43 PM, Michel Jouvin<[log in to unmask]> wrote: > Be aware that 2.3.5/6 have a serious flaw : restarting Torque server has a > great chance to kill all clients... You need to run a cron job on WN to > restart pbs_mom in such a situation... We've tried to reproduce this but have failed. Is anyone else running 2.3.6 of site of size? Steve > > AFAIK this is a know problem by Torque maintainers with no fix available > yet. > > Michel > > --On vendredi 10 juillet 2009 14:27 +0200 Andreas Unterkircher > <[log in to unmask]> wrote: > >> Hi Dima, >> >> if you use the the PPS repository for SL4/gLite 3.1 you get Torque 2.3.6, >> however for SL5/gLite 3.2 we still have Torque 2.3.0 (2.3.6 is currently >> in certification for Sl5). So if you mix SL4/SL5 with PPS you have to >> sort out this yourself. I'd recommend to update manually to Torque 2.3.6 >> on SL5. The rpms are here: >> >> http://skoji.cern.ch/sa1/centos5-torque/ >> >> Best regards, >> Andreas >> >> On Thu, 9 Jul 2009, Dmitri Ozerov wrote: >> >>> Hi Andreas, >>> >>> for production system we will use (and using already) the production >>> meta packages. But what is with the PPS system? Do you advice now to use >>> production version on the server and pps version on client and change >>> once the pps-lcg-ce/torque will be released for pps? >>> >>> Cheers, >>> Dima. >>> >>> On Thu, 9 Jul 2009, Andreas Unterkircher wrote: >>> >>> > Hi Dmitry, >>> > >>> > looking at the rpm lists you send I see that you are using torque >>> > 2.3.6 on the server and torque 2.3.0 for the client. These two are not >>> > compatible. >>> > >>> > However the gLite 3.1/SL4 glite-TORQUE_server meta package uses torque >>> > 2.3.0 and the gLite 3.2/SL5 glite-TORQUE_client meta package also uses >>> > torque 2.3.0. So if you use the production versions of gLite 3.1/3.2 >>> > you should not see this problem. >>> > If you really want to use 2.3.6 right now you have to >>> > manually make sure that you use the same torque versions on all nodes. >>> > >>> > Best regards, >>> > Andreas >>> > >>> > >>> > >>> > On Thu, 9 Jul 2009, Dmitry Ozerov wrote: >>> > >>> >> Hi, >>> >> >>> >> i'm out of ideas and need help on this topic: >>> >> >>> >> i have problem with the PPS-glite-WN glite 3.2 installation. >>> >> The batch and lcg-CE are installed on two different machines. >>> >> The problem is traced to the communication between the batch server >>> >> and the node. Being the grid user on the server and making qsub -I -q >>> >> <queue>, i see the job with the qstat in queued mode, than after few >>> >> minutes the node to which the job is sent become "down" and on the >>> >> node: /etc/init.d/pbs_mom status >>> >> pbs_mom dead but subsys locked >>> >> (without jobs the pbs_mom is "running"). >>> >> >>> >> The message in /var/spool/pbs/server_logs on server side is : >>> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;send of job >>> >> to tb019.desy.de failed error = 15031 >>> >> 07/09/2009 16:28:27;0001;PBS_Server;Svr;PBS_Server;Batch protocol >>> >> error (15031) in send_job, child failed in previous commit request >>> >> for job 658.tb021.desy.d >>> >> e >>> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;unable to >>> >> run job, MOM rejected/rc=1 >>> >> 07/09/2009 16:28:27;0080;PBS_Server;Req;req_reject;Reject reply >>> >> code=15041(Execution server rejected request MSG=cannot send job to >>> >> mom, state=PRERUN), aux=0 >>> >> , type=RunJob, from [log in to unmask] >>> >> >>> >> When i connect to this server pps wn from the glite 3.1 release - >>> >> everything went fine. >>> >> >>> >> Details: >>> >> server: >>> >> Scientific Linux SL release 4.7 (Beryllium) >>> >> PPS-glite-TORQUE_server-3.1.9-0 >>> >> PPS-glite-TORQUE_utils-3.1.12-0 >>> >> torque-devel-2.3.6-1cri.slc4 >>> >> torque-drmaa-2.3.6-1cri.slc4 >>> >> glite-yaim-torque-utils-4.0.3-1 >>> >> torque-client-2.3.6-1cri.slc4 >>> >> torque-server-2.3.6-1cri.slc4 >>> >> torque-drmaa-docs-2.3.6-1cri.slc4 >>> >> glite-yaim-torque-server-4.0.3-2 >>> >> torque-2.3.6-1cri.slc4 >>> >> torque-docs-2.3.6-1cri.slc4 >>> >> Linux tb021 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1 07:21:08 EDT 2009 >>> >> x86_64 x86_64 x86_64 GNU/Linux >>> >> >>> >> 5.3 client: >>> >> Scientific Linux SL release 5.3 (Boron) >>> >> PPS-glite-WN-version-3.2.3-0 >>> >> PPS-glite-TORQUE_client-3.2.1-0 >>> >> torque-mom-2.3.0-snap.200801151629.2cri.sl5 >>> >> torque-client-2.3.0-snap.200801151629.2cri.sl5 >>> >> torque-2.3.0-snap.200801151629.2cri.sl5 >>> >> glite-yaim-torque-client-4.0.1-1 >>> >> Linux tb019.desy.de 2.6.18-128.1.14.el5 #1 SMP Tue Jun 16 18:47:37 EDT >>> >> 2009 x86_64 x86_64 x86_64 GNU/Linux >>> >> >>> >> 4.7 client: >>> >> Scientific Linux SL release 4.7 (Beryllium) >>> >> PPS-glite-TORQUE_client-3.1.8-0.i386 >>> >> PPS-glite-WN-3.1.35-0.i386 >>> >> torque-docs-2.3.6-1cri.slc4.i386 >>> >> torque-client-2.3.6-1cri.slc4.i386 >>> >> glite-yaim-torque-client-4.0.2-1.noarch >>> >> torque-mom-2.3.6-1cri.slc4.i386 >>> >> torque-pam-2.3.6-1cri.slc4.i386 >>> >> torque-devel-2.3.6-1cri.slc4.i386 >>> >> torque-2.3.6-1cri.slc4.i386 >>> >> Linux tb018.desy.de 2.6.9-78.0.22.ELsmp #1 SMP Thu Apr 30 23:30:54 CDT >>> >> 2009 i686 i686 i386 GNU/Linux >>> >> >>> >> Thanks for any help, >>> >> Dima. >>> >> >>> >> P.S. (i can ssh from the client to server without password) >>> >> >>> > >>> > -- >>> > Andreas Unterkircher >>> > IT Department >>> > Grid Deployment Group >>> > CERN >>> > CH-1211 Geneva 23 >>> > >>> >>> >>> Cheers, >>> Dima. >>> >> >> -- >> Andreas Unterkircher >> IT Department >> Grid Deployment Group >> CERN >> CH-1211 Geneva 23 > > > > ************************************************************* > * Michel Jouvin Email : [log in to unmask] * > * LAL / CNRS Tel : +33 1 64468932 * > * B.P. 34 Fax : +33 1 69079404 * > * 91898 Orsay Cedex * > * France * > ************************************************************* > -- Steve Traylen