Print

Print


On Fri, Jul 10, 2009 at 2:43 PM, Michel Jouvin<[log in to unmask]> wrote:
> Be aware that 2.3.5/6 have a serious flaw : restarting Torque server has a
> great chance to kill all clients... You need to run a cron job on WN to
> restart pbs_mom in such a situation...

We've tried to reproduce this but have failed. Is anyone else running
2.3.6 of site of size?

Steve


>
> AFAIK this is a know problem by Torque maintainers with no fix available
> yet.
>
> Michel
>
> --On vendredi 10 juillet 2009 14:27 +0200 Andreas Unterkircher
> <[log in to unmask]> wrote:
>
>> Hi Dima,
>>
>> if you use the the PPS repository for SL4/gLite 3.1 you get Torque 2.3.6,
>> however for SL5/gLite 3.2 we still have Torque 2.3.0 (2.3.6 is currently
>> in certification for Sl5). So if you mix SL4/SL5 with PPS you have to
>> sort  out this yourself. I'd recommend to update manually to Torque 2.3.6
>> on  SL5. The rpms are here:
>>
>> http://skoji.cern.ch/sa1/centos5-torque/
>>
>> Best regards,
>> Andreas
>>
>> On Thu, 9 Jul 2009, Dmitri Ozerov wrote:
>>
>>>   Hi Andreas,
>>>
>>>  for production system we will use (and using already) the production
>>> meta packages. But what is with the PPS system? Do you advice now to use
>>> production version on the server and pps version on client and change
>>> once the pps-lcg-ce/torque will be released for pps?
>>>
>>>  Cheers,
>>>  Dima.
>>>
>>> On Thu, 9 Jul 2009, Andreas Unterkircher wrote:
>>>
>>> > Hi Dmitry,
>>> >
>>> > looking at the rpm lists you send I see that you are using torque
>>> > 2.3.6 on the server and torque 2.3.0 for the client. These two are not
>>> > compatible.
>>> >
>>> > However the gLite 3.1/SL4 glite-TORQUE_server meta package uses torque
>>> > 2.3.0 and the gLite 3.2/SL5 glite-TORQUE_client meta package also uses
>>> > torque 2.3.0. So if you use the production versions of gLite 3.1/3.2
>>> > you should not see this problem.
>>> > If you really want to use 2.3.6 right now you have to
>>> > manually make sure that you use the same torque versions on all nodes.
>>> >
>>> > Best regards,
>>> > Andreas
>>> >
>>> >
>>> >
>>> > On Thu, 9 Jul 2009, Dmitry Ozerov wrote:
>>> >
>>> >>   Hi,
>>> >>
>>> >>  i'm out of ideas and need help on this topic:
>>> >>
>>> >>  i have problem with the PPS-glite-WN glite 3.2 installation.
>>> >> The batch and lcg-CE are installed on two different machines.
>>> >> The problem is traced to the communication between the batch server
>>> >> and the node. Being the grid user on the server and making qsub -I -q
>>> >> <queue>, i see the job with the qstat in queued mode, than after few
>>> >> minutes the node to which the job is sent become "down" and on the
>>> >> node: /etc/init.d/pbs_mom status
>>> >> pbs_mom dead but subsys locked
>>> >> (without jobs the pbs_mom is "running").
>>> >>
>>> >> The message in /var/spool/pbs/server_logs on server side is :
>>> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;send of job
>>> >> to tb019.desy.de failed error = 15031
>>> >> 07/09/2009 16:28:27;0001;PBS_Server;Svr;PBS_Server;Batch protocol
>>> >> error (15031) in send_job, child failed in previous commit request
>>> >> for job 658.tb021.desy.d
>>> >> e
>>> >> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;unable to
>>> >> run job, MOM rejected/rc=1
>>> >> 07/09/2009 16:28:27;0080;PBS_Server;Req;req_reject;Reject reply
>>> >> code=15041(Execution server rejected request MSG=cannot send job to
>>> >> mom, state=PRERUN), aux=0
>>> >> , type=RunJob, from [log in to unmask]
>>> >>
>>> >>  When i connect to this server pps wn from the glite 3.1 release -
>>> >> everything went fine.
>>> >>
>>> >>  Details:
>>> >> server:
>>> >> Scientific Linux SL release 4.7 (Beryllium)
>>> >> PPS-glite-TORQUE_server-3.1.9-0
>>> >> PPS-glite-TORQUE_utils-3.1.12-0
>>> >> torque-devel-2.3.6-1cri.slc4
>>> >> torque-drmaa-2.3.6-1cri.slc4
>>> >> glite-yaim-torque-utils-4.0.3-1
>>> >> torque-client-2.3.6-1cri.slc4
>>> >> torque-server-2.3.6-1cri.slc4
>>> >> torque-drmaa-docs-2.3.6-1cri.slc4
>>> >> glite-yaim-torque-server-4.0.3-2
>>> >> torque-2.3.6-1cri.slc4
>>> >> torque-docs-2.3.6-1cri.slc4
>>> >> Linux tb021 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1 07:21:08 EDT 2009
>>> >> x86_64 x86_64 x86_64 GNU/Linux
>>> >>
>>> >> 5.3 client:
>>> >> Scientific Linux SL release 5.3 (Boron)
>>> >> PPS-glite-WN-version-3.2.3-0
>>> >> PPS-glite-TORQUE_client-3.2.1-0
>>> >> torque-mom-2.3.0-snap.200801151629.2cri.sl5
>>> >> torque-client-2.3.0-snap.200801151629.2cri.sl5
>>> >> torque-2.3.0-snap.200801151629.2cri.sl5
>>> >> glite-yaim-torque-client-4.0.1-1
>>> >> Linux tb019.desy.de 2.6.18-128.1.14.el5 #1 SMP Tue Jun 16 18:47:37 EDT
>>> >> 2009 x86_64 x86_64 x86_64 GNU/Linux
>>> >>
>>> >> 4.7 client:
>>> >> Scientific Linux SL release 4.7 (Beryllium)
>>> >> PPS-glite-TORQUE_client-3.1.8-0.i386
>>> >> PPS-glite-WN-3.1.35-0.i386
>>> >> torque-docs-2.3.6-1cri.slc4.i386
>>> >> torque-client-2.3.6-1cri.slc4.i386
>>> >> glite-yaim-torque-client-4.0.2-1.noarch
>>> >> torque-mom-2.3.6-1cri.slc4.i386
>>> >> torque-pam-2.3.6-1cri.slc4.i386
>>> >> torque-devel-2.3.6-1cri.slc4.i386
>>> >> torque-2.3.6-1cri.slc4.i386
>>> >> Linux tb018.desy.de 2.6.9-78.0.22.ELsmp #1 SMP Thu Apr 30 23:30:54 CDT
>>> >> 2009 i686 i686 i386 GNU/Linux
>>> >>
>>> >>  Thanks for any help,
>>> >>  Dima.
>>> >>
>>> >> P.S. (i can ssh from the client to server without password)
>>> >>
>>> >
>>> > --
>>> > Andreas Unterkircher
>>> > IT Department
>>> > Grid Deployment Group
>>> > CERN
>>> > CH-1211 Geneva 23
>>> >
>>>
>>>
>>>   Cheers,
>>>   Dima.
>>>
>>
>> --
>> Andreas Unterkircher
>> IT Department
>> Grid Deployment Group
>> CERN
>> CH-1211 Geneva 23
>
>
>
>    *************************************************************
>    * Michel Jouvin                 Email : [log in to unmask] *
>    * LAL / CNRS                    Tel : +33 1 64468932        *
>    * B.P. 34                       Fax : +33 1 69079404        *
>    * 91898 Orsay Cedex                                         *
>    * France                                                    *
>    *************************************************************
>



-- 
Steve Traylen