Be aware that 2.3.5/6 have a serious flaw : restarting Torque server has a
great chance to kill all clients... You need to run a cron job on WN to
restart pbs_mom in such a situation...
AFAIK this is a know problem by Torque maintainers with no fix available
yet.
Michel
--On vendredi 10 juillet 2009 14:27 +0200 Andreas Unterkircher
<[log in to unmask]> wrote:
> Hi Dima,
>
> if you use the the PPS repository for SL4/gLite 3.1 you get Torque 2.3.6,
> however for SL5/gLite 3.2 we still have Torque 2.3.0 (2.3.6 is currently
> in certification for Sl5). So if you mix SL4/SL5 with PPS you have to
> sort out this yourself. I'd recommend to update manually to Torque 2.3.6
> on SL5. The rpms are here:
>
> http://skoji.cern.ch/sa1/centos5-torque/
>
> Best regards,
> Andreas
>
> On Thu, 9 Jul 2009, Dmitri Ozerov wrote:
>
>> Hi Andreas,
>>
>> for production system we will use (and using already) the production
>> meta packages. But what is with the PPS system? Do you advice now to use
>> production version on the server and pps version on client and change
>> once the pps-lcg-ce/torque will be released for pps?
>>
>> Cheers,
>> Dima.
>>
>> On Thu, 9 Jul 2009, Andreas Unterkircher wrote:
>>
>> > Hi Dmitry,
>> >
>> > looking at the rpm lists you send I see that you are using torque
>> > 2.3.6 on the server and torque 2.3.0 for the client. These two are not
>> > compatible.
>> >
>> > However the gLite 3.1/SL4 glite-TORQUE_server meta package uses torque
>> > 2.3.0 and the gLite 3.2/SL5 glite-TORQUE_client meta package also uses
>> > torque 2.3.0. So if you use the production versions of gLite 3.1/3.2
>> > you should not see this problem.
>> > If you really want to use 2.3.6 right now you have to
>> > manually make sure that you use the same torque versions on all nodes.
>> >
>> > Best regards,
>> > Andreas
>> >
>> >
>> >
>> > On Thu, 9 Jul 2009, Dmitry Ozerov wrote:
>> >
>> >> Hi,
>> >>
>> >> i'm out of ideas and need help on this topic:
>> >>
>> >> i have problem with the PPS-glite-WN glite 3.2 installation.
>> >> The batch and lcg-CE are installed on two different machines.
>> >> The problem is traced to the communication between the batch server
>> >> and the node. Being the grid user on the server and making qsub -I -q
>> >> <queue>, i see the job with the qstat in queued mode, than after few
>> >> minutes the node to which the job is sent become "down" and on the
>> >> node: /etc/init.d/pbs_mom status
>> >> pbs_mom dead but subsys locked
>> >> (without jobs the pbs_mom is "running").
>> >>
>> >> The message in /var/spool/pbs/server_logs on server side is :
>> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;send of job
>> >> to tb019.desy.de failed error = 15031
>> >> 07/09/2009 16:28:27;0001;PBS_Server;Svr;PBS_Server;Batch protocol
>> >> error (15031) in send_job, child failed in previous commit request
>> >> for job 658.tb021.desy.d
>> >> e
>> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;unable to
>> >> run job, MOM rejected/rc=1
>> >> 07/09/2009 16:28:27;0080;PBS_Server;Req;req_reject;Reject reply
>> >> code=15041(Execution server rejected request MSG=cannot send job to
>> >> mom, state=PRERUN), aux=0
>> >> , type=RunJob, from [log in to unmask]
>> >>
>> >> When i connect to this server pps wn from the glite 3.1 release -
>> >> everything went fine.
>> >>
>> >> Details:
>> >> server:
>> >> Scientific Linux SL release 4.7 (Beryllium)
>> >> PPS-glite-TORQUE_server-3.1.9-0
>> >> PPS-glite-TORQUE_utils-3.1.12-0
>> >> torque-devel-2.3.6-1cri.slc4
>> >> torque-drmaa-2.3.6-1cri.slc4
>> >> glite-yaim-torque-utils-4.0.3-1
>> >> torque-client-2.3.6-1cri.slc4
>> >> torque-server-2.3.6-1cri.slc4
>> >> torque-drmaa-docs-2.3.6-1cri.slc4
>> >> glite-yaim-torque-server-4.0.3-2
>> >> torque-2.3.6-1cri.slc4
>> >> torque-docs-2.3.6-1cri.slc4
>> >> Linux tb021 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1 07:21:08 EDT 2009
>> >> x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >> 5.3 client:
>> >> Scientific Linux SL release 5.3 (Boron)
>> >> PPS-glite-WN-version-3.2.3-0
>> >> PPS-glite-TORQUE_client-3.2.1-0
>> >> torque-mom-2.3.0-snap.200801151629.2cri.sl5
>> >> torque-client-2.3.0-snap.200801151629.2cri.sl5
>> >> torque-2.3.0-snap.200801151629.2cri.sl5
>> >> glite-yaim-torque-client-4.0.1-1
>> >> Linux tb019.desy.de 2.6.18-128.1.14.el5 #1 SMP Tue Jun 16 18:47:37 EDT
>> >> 2009 x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >> 4.7 client:
>> >> Scientific Linux SL release 4.7 (Beryllium)
>> >> PPS-glite-TORQUE_client-3.1.8-0.i386
>> >> PPS-glite-WN-3.1.35-0.i386
>> >> torque-docs-2.3.6-1cri.slc4.i386
>> >> torque-client-2.3.6-1cri.slc4.i386
>> >> glite-yaim-torque-client-4.0.2-1.noarch
>> >> torque-mom-2.3.6-1cri.slc4.i386
>> >> torque-pam-2.3.6-1cri.slc4.i386
>> >> torque-devel-2.3.6-1cri.slc4.i386
>> >> torque-2.3.6-1cri.slc4.i386
>> >> Linux tb018.desy.de 2.6.9-78.0.22.ELsmp #1 SMP Thu Apr 30 23:30:54 CDT
>> >> 2009 i686 i686 i386 GNU/Linux
>> >>
>> >> Thanks for any help,
>> >> Dima.
>> >>
>> >> P.S. (i can ssh from the client to server without password)
>> >>
>> >
>> > --
>> > Andreas Unterkircher
>> > IT Department
>> > Grid Deployment Group
>> > CERN
>> > CH-1211 Geneva 23
>> >
>>
>>
>> Cheers,
>> Dima.
>>
>
> --
> Andreas Unterkircher
> IT Department
> Grid Deployment Group
> CERN
> CH-1211 Geneva 23
*************************************************************
* Michel Jouvin Email : [log in to unmask] *
* LAL / CNRS Tel : +33 1 64468932 *
* B.P. 34 Fax : +33 1 69079404 *
* 91898 Orsay Cedex *
* France *
*************************************************************
|