Hi Steve,
We are running the version 2.3.6 (client and server) and see that the
pbs_moms die with segfault after a server restart:
Client: (SL53)
> rpm -qa | grep torque
torque-2.3.6-1cri.sl5
torque-mom-2.3.6-1cri.sl5
torque-client-2.3.6-1cri.sl5
> less /var/log/messages
...
Jul 15 16:59:13 grid-swm003 pbs_mom: No such process (3) in
mom_get_sample, 32411: get_proc_stat
Jul 16 09:48:35 grid-swm003 kernel: pbs_mom[6410]: segfault at
0000000000000004 rip 000000000041812f rsp
0000
7fff99fb1040 error 4
Server: (SL53)
> rpm -qa | grep torque
torque-2.3.6-1cri.sl5
glite-yaim-torque-server-4.0.1-5
torque-server-2.3.6-1cri.sl5
torque-client-2.3.6-1cri.sl5
Cheers
Andreas
On Fri, 10 Jul 2009, Steve Traylen wrote:
> On Fri, Jul 10, 2009 at 2:43 PM, Michel Jouvin<[log in to unmask]> wrote:
>> Be aware that 2.3.5/6 have a serious flaw : restarting Torque server has a
>> great chance to kill all clients... You need to run a cron job on WN to
>> restart pbs_mom in such a situation...
>
> We've tried to reproduce this but have failed. Is anyone else running
> 2.3.6 of site of size?
>
> Steve
>
>
>>
>> AFAIK this is a know problem by Torque maintainers with no fix available
>> yet.
>>
>> Michel
>>
>> --On vendredi 10 juillet 2009 14:27 +0200 Andreas Unterkircher
>> <[log in to unmask]> wrote:
>>
>>> Hi Dima,
>>>
>>> if you use the the PPS repository for SL4/gLite 3.1 you get Torque 2.3.6,
>>> however for SL5/gLite 3.2 we still have Torque 2.3.0 (2.3.6 is currently
>>> in certification for Sl5). So if you mix SL4/SL5 with PPS you have to
>>> sort out this yourself. I'd recommend to update manually to Torque 2.3.6
>>> on SL5. The rpms are here:
>>>
>>> http://skoji.cern.ch/sa1/centos5-torque/
>>>
>>> Best regards,
>>> Andreas
>>>
>>> On Thu, 9 Jul 2009, Dmitri Ozerov wrote:
>>>
>>>> Hi Andreas,
>>>>
>>>> for production system we will use (and using already) the production
>>>> meta packages. But what is with the PPS system? Do you advice now to use
>>>> production version on the server and pps version on client and change
>>>> once the pps-lcg-ce/torque will be released for pps?
>>>>
>>>> Cheers,
>>>> Dima.
>>>>
>>>> On Thu, 9 Jul 2009, Andreas Unterkircher wrote:
>>>>
>>>>> Hi Dmitry,
>>>>>
>>>>> looking at the rpm lists you send I see that you are using torque
>>>>> 2.3.6 on the server and torque 2.3.0 for the client. These two are not
>>>>> compatible.
>>>>>
>>>>> However the gLite 3.1/SL4 glite-TORQUE_server meta package uses torque
>>>>> 2.3.0 and the gLite 3.2/SL5 glite-TORQUE_client meta package also uses
>>>>> torque 2.3.0. So if you use the production versions of gLite 3.1/3.2
>>>>> you should not see this problem.
>>>>> If you really want to use 2.3.6 right now you have to
>>>>> manually make sure that you use the same torque versions on all nodes.
>>>>>
>>>>> Best regards,
>>>>> Andreas
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 9 Jul 2009, Dmitry Ozerov wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> i'm out of ideas and need help on this topic:
>>>>>>
>>>>>> i have problem with the PPS-glite-WN glite 3.2 installation.
>>>>>> The batch and lcg-CE are installed on two different machines.
>>>>>> The problem is traced to the communication between the batch server
>>>>>> and the node. Being the grid user on the server and making qsub -I -q
>>>>>> <queue>, i see the job with the qstat in queued mode, than after few
>>>>>> minutes the node to which the job is sent become "down" and on the
>>>>>> node: /etc/init.d/pbs_mom status
>>>>>> pbs_mom dead but subsys locked
>>>>>> (without jobs the pbs_mom is "running").
>>>>>>
>>>>>> The message in /var/spool/pbs/server_logs on server side is :
>>>>>> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;send of job
>>>>>> to tb019.desy.de failed error = 15031
>>>>>> 07/09/2009 16:28:27;0001;PBS_Server;Svr;PBS_Server;Batch protocol
>>>>>> error (15031) in send_job, child failed in previous commit request
>>>>>> for job 658.tb021.desy.d
>>>>>> e
>>>>>> 07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;unable to
>>>>>> run job, MOM rejected/rc=1
>>>>>> 07/09/2009 16:28:27;0080;PBS_Server;Req;req_reject;Reject reply
>>>>>> code=15041(Execution server rejected request MSG=cannot send job to
>>>>>> mom, state=PRERUN), aux=0
>>>>>> , type=RunJob, from [log in to unmask]
>>>>>>
>>>>>> When i connect to this server pps wn from the glite 3.1 release -
>>>>>> everything went fine.
>>>>>>
>>>>>> Details:
>>>>>> server:
>>>>>> Scientific Linux SL release 4.7 (Beryllium)
>>>>>> PPS-glite-TORQUE_server-3.1.9-0
>>>>>> PPS-glite-TORQUE_utils-3.1.12-0
>>>>>> torque-devel-2.3.6-1cri.slc4
>>>>>> torque-drmaa-2.3.6-1cri.slc4
>>>>>> glite-yaim-torque-utils-4.0.3-1
>>>>>> torque-client-2.3.6-1cri.slc4
>>>>>> torque-server-2.3.6-1cri.slc4
>>>>>> torque-drmaa-docs-2.3.6-1cri.slc4
>>>>>> glite-yaim-torque-server-4.0.3-2
>>>>>> torque-2.3.6-1cri.slc4
>>>>>> torque-docs-2.3.6-1cri.slc4
>>>>>> Linux tb021 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1 07:21:08 EDT 2009
>>>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>> 5.3 client:
>>>>>> Scientific Linux SL release 5.3 (Boron)
>>>>>> PPS-glite-WN-version-3.2.3-0
>>>>>> PPS-glite-TORQUE_client-3.2.1-0
>>>>>> torque-mom-2.3.0-snap.200801151629.2cri.sl5
>>>>>> torque-client-2.3.0-snap.200801151629.2cri.sl5
>>>>>> torque-2.3.0-snap.200801151629.2cri.sl5
>>>>>> glite-yaim-torque-client-4.0.1-1
>>>>>> Linux tb019.desy.de 2.6.18-128.1.14.el5 #1 SMP Tue Jun 16 18:47:37 EDT
>>>>>> 2009 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>> 4.7 client:
>>>>>> Scientific Linux SL release 4.7 (Beryllium)
>>>>>> PPS-glite-TORQUE_client-3.1.8-0.i386
>>>>>> PPS-glite-WN-3.1.35-0.i386
>>>>>> torque-docs-2.3.6-1cri.slc4.i386
>>>>>> torque-client-2.3.6-1cri.slc4.i386
>>>>>> glite-yaim-torque-client-4.0.2-1.noarch
>>>>>> torque-mom-2.3.6-1cri.slc4.i386
>>>>>> torque-pam-2.3.6-1cri.slc4.i386
>>>>>> torque-devel-2.3.6-1cri.slc4.i386
>>>>>> torque-2.3.6-1cri.slc4.i386
>>>>>> Linux tb018.desy.de 2.6.9-78.0.22.ELsmp #1 SMP Thu Apr 30 23:30:54 CDT
>>>>>> 2009 i686 i686 i386 GNU/Linux
>>>>>>
>>>>>> Thanks for any help,
>>>>>> Dima.
>>>>>>
>>>>>> P.S. (i can ssh from the client to server without password)
>>>>>>
>>>>>
>>>>> --
>>>>> Andreas Unterkircher
>>>>> IT Department
>>>>> Grid Deployment Group
>>>>> CERN
>>>>> CH-1211 Geneva 23
>>>>>
>>>>
>>>>
>>>> Cheers,
>>>> Dima.
>>>>
>>>
>>> --
>>> Andreas Unterkircher
>>> IT Department
>>> Grid Deployment Group
>>> CERN
>>> CH-1211 Geneva 23
>>
>>
>>
>> *************************************************************
>> * Michel Jouvin Email : [log in to unmask] *
>> * LAL / CNRS Tel : +33 1 64468932 *
>> * B.P. 34 Fax : +33 1 69079404 *
>> * 91898 Orsay Cedex *
>> * France *
>> *************************************************************
>>
>
>
>
> --
> Steve Traylen
>
----
Andreas Gellrich <[log in to unmask]>
DESY IT / Grid Computing
http://www.desy.de/~gellrich
|