Hi Michel,
thanks for pointing out this danger. Do you know a bit more? Is the problem the mom in 2.3.5/6? We have been running without any similar glitch in the production system using 2.3.5 server with 2.3.0 clients.
Best regards, Christoph
On Fri, 10 Jul 2009 14:43:08 +0200
Michel Jouvin <[log in to unmask]> wrote:
> Be aware that 2.3.5/6 have a serious flaw : restarting Torque server has a
> great chance to kill all clients... You need to run a cron job on WN to
> restart pbs_mom in such a situation...
>
> AFAIK this is a know problem by Torque maintainers with no fix available
> yet.
>
> Michel
>
> --On vendredi 10 juillet 2009 14:27 +0200 Andreas Unterkircher
> <[log in to unmask]> wrote:
>
> > Hi Dima,
> >
> > if you use the the PPS repository for SL4/gLite 3.1 you get Torque 2.3.6,
> > however for SL5/gLite 3.2 we still have Torque 2.3.0 (2.3.6 is currently
> > in certification for Sl5). So if you mix SL4/SL5 with PPS you have to
> > sort out this yourself. I'd recommend to update manually to Torque 2.3.6
> > on SL5. The rpms are here:
> >
> > http://skoji.cern.ch/sa1/centos5-torque/
> >
> > Best regards,
> > Andreas
> >
> > On Thu, 9 Jul 2009, Dmitri Ozerov wrote:
> >
> >> Hi Andreas,
> >>
> >> for production system we will use (and using already) the production
> >> meta packages. But what is with the PPS system? Do you advice now to use
> >> production version on the server and pps version on client and change
> >> once the pps-lcg-ce/torque will be released for pps?
> >>
> >> Cheers,
> >> Dima.
> >>
> >> On Thu, 9 Jul 2009, Andreas Unterkircher wrote:
> >>
> >> > Hi Dmitry,
> >> >
> >> > looking at the rpm lists you send I see that you are using torque
> >> > 2.3.6 on the server and torque 2.3.0 for the client. These two are not
> >> > compatible.
> >> >
> >> > However the gLite 3.1/SL4 glite-TORQUE_server meta package uses torque
> >> > 2.3.0 and the gLite 3.2/SL5 glite-TORQUE_client meta package also uses
> >> > torque 2.3.0. So if you use the production versions of gLite 3.1/3.2
> >> > you should not see this problem.
> >> > If you really want to use 2.3.6 right now you have to
> >> > manually make sure that you use the same torque versions on all nodes.
> >> >
> >> > Best regards,
> >> > Andreas
> >> >
> >> >
> >> >
> >> > On Thu, 9 Jul 2009, Dmitry Ozerov wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> i'm out of ideas and need help on this topic:
> >> >>
> >> >> i have problem with the PPS-glite-WN glite 3.2 installation.
> >> >> The batch and lcg-CE are installed on two different machines.
> >> >> The problem is traced to the communication between the batch server
> >> >> and the node. Being the grid user on the server and making qsub -I -q
> >> >> <queue>, i see the job with the qstat in queued mode, than after few
> >> >> minutes the node to which the job is sent become "down" and on the
> >> >> node: /etc/init.d/pbs_mom status
> >> >> pbs_mom dead but subsys locked
> >> >> (without jobs the pbs_mom is "running").
> >> >>
> >> >> The message in /var/spool/pbs/server_logs on server side is :
> >> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;send of job
> >> >> to tb019.desy.de failed error = 15031
> >> >> 07/09/2009 16:28:27;0001;PBS_Server;Svr;PBS_Server;Batch protocol
> >> >> error (15031) in send_job, child failed in previous commit request
> >> >> for job 658.tb021.desy.d
> >> >> e
> >> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;unable to
> >> >> run job, MOM rejected/rc=1
> >> >> 07/09/2009 16:28:27;0080;PBS_Server;Req;req_reject;Reject reply
> >> >> code=15041(Execution server rejected request MSG=cannot send job to
> >> >> mom, state=PRERUN), aux=0
> >> >> , type=RunJob, from [log in to unmask]
> >> >>
> >> >> When i connect to this server pps wn from the glite 3.1 release -
> >> >> everything went fine.
> >> >>
> >> >> Details:
> >> >> server:
> >> >> Scientific Linux SL release 4.7 (Beryllium)
> >> >> PPS-glite-TORQUE_server-3.1.9-0
> >> >> PPS-glite-TORQUE_utils-3.1.12-0
> >> >> torque-devel-2.3.6-1cri.slc4
> >> >> torque-drmaa-2.3.6-1cri.slc4
> >> >> glite-yaim-torque-utils-4.0.3-1
> >> >> torque-client-2.3.6-1cri.slc4
> >> >> torque-server-2.3.6-1cri.slc4
> >> >> torque-drmaa-docs-2.3.6-1cri.slc4
> >> >> glite-yaim-torque-server-4.0.3-2
> >> >> torque-2.3.6-1cri.slc4
> >> >> torque-docs-2.3.6-1cri.slc4
> >> >> Linux tb021 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1 07:21:08 EDT 2009
> >> >> x86_64 x86_64 x86_64 GNU/Linux
> >> >>
> >> >> 5.3 client:
> >> >> Scientific Linux SL release 5.3 (Boron)
> >> >> PPS-glite-WN-version-3.2.3-0
> >> >> PPS-glite-TORQUE_client-3.2.1-0
> >> >> torque-mom-2.3.0-snap.200801151629.2cri.sl5
> >> >> torque-client-2.3.0-snap.200801151629.2cri.sl5
> >> >> torque-2.3.0-snap.200801151629.2cri.sl5
> >> >> glite-yaim-torque-client-4.0.1-1
> >> >> Linux tb019.desy.de 2.6.18-128.1.14.el5 #1 SMP Tue Jun 16 18:47:37 EDT
> >> >> 2009 x86_64 x86_64 x86_64 GNU/Linux
> >> >>
> >> >> 4.7 client:
> >> >> Scientific Linux SL release 4.7 (Beryllium)
> >> >> PPS-glite-TORQUE_client-3.1.8-0.i386
> >> >> PPS-glite-WN-3.1.35-0.i386
> >> >> torque-docs-2.3.6-1cri.slc4.i386
> >> >> torque-client-2.3.6-1cri.slc4.i386
> >> >> glite-yaim-torque-client-4.0.2-1.noarch
> >> >> torque-mom-2.3.6-1cri.slc4.i386
> >> >> torque-pam-2.3.6-1cri.slc4.i386
> >> >> torque-devel-2.3.6-1cri.slc4.i386
> >> >> torque-2.3.6-1cri.slc4.i386
> >> >> Linux tb018.desy.de 2.6.9-78.0.22.ELsmp #1 SMP Thu Apr 30 23:30:54 CDT
> >> >> 2009 i686 i686 i386 GNU/Linux
> >> >>
> >> >> Thanks for any help,
> >> >> Dima.
> >> >>
> >> >> P.S. (i can ssh from the client to server without password)
> >> >>
> >> >
> >> > --
> >> > Andreas Unterkircher
> >> > IT Department
> >> > Grid Deployment Group
> >> > CERN
> >> > CH-1211 Geneva 23
> >> >
> >>
> >>
> >> Cheers,
> >> Dima.
> >>
> >
> > --
> > Andreas Unterkircher
> > IT Department
> > Grid Deployment Group
> > CERN
> > CH-1211 Geneva 23
>
>
>
> *************************************************************
> * Michel Jouvin Email : [log in to unmask] *
> * LAL / CNRS Tel : +33 1 64468932 *
> * B.P. 34 Fax : +33 1 69079404 *
> * 91898 Orsay Cedex *
> * France *
> *************************************************************
--
+-----------------------------------+
| Christoph Wissing DESY - CMS |
| E-Mail: [log in to unmask] |
| Phone: +49(0)40/8998-4122 |
+-----------------------------------+
|