Hi,
i'm out of ideas and need help on this topic:
i have problem with the PPS-glite-WN glite 3.2 installation.
The batch and lcg-CE are installed on two different machines.
The problem is traced to the communication between the batch server
and the node. Being the grid user on the server and making qsub -I -q
<queue>, i see the job with the qstat in queued mode, than after few
minutes the node to which the job is sent become "down" and on the node:
/etc/init.d/pbs_mom status
pbs_mom dead but subsys locked
(without jobs the pbs_mom is "running").
The message in /var/spool/pbs/server_logs on server side is :
07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;send of job to
tb019.desy.de failed error = 15031
07/09/2009 16:28:27;0001;PBS_Server;Svr;PBS_Server;Batch protocol error
(15031) in send_job, child failed in previous commit request for job
658.tb021.desy.d
e
07/09/2009 16:28:27;0008;PBS_Server;Job;658.tb021.desy.de;unable to run
job, MOM rejected/rc=1
07/09/2009 16:28:27;0080;PBS_Server;Req;req_reject;Reject reply
code=15041(Execution server rejected request MSG=cannot send job to mom,
state=PRERUN), aux=0
, type=RunJob, from [log in to unmask]
When i connect to this server pps wn from the glite 3.1 release -
everything went fine.
Details:
server:
Scientific Linux SL release 4.7 (Beryllium)
PPS-glite-TORQUE_server-3.1.9-0
PPS-glite-TORQUE_utils-3.1.12-0
torque-devel-2.3.6-1cri.slc4
torque-drmaa-2.3.6-1cri.slc4
glite-yaim-torque-utils-4.0.3-1
torque-client-2.3.6-1cri.slc4
torque-server-2.3.6-1cri.slc4
torque-drmaa-docs-2.3.6-1cri.slc4
glite-yaim-torque-server-4.0.3-2
torque-2.3.6-1cri.slc4
torque-docs-2.3.6-1cri.slc4
Linux tb021 2.6.18-128.1.6.el5xen #1 SMP Wed Apr 1 07:21:08 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux
5.3 client:
Scientific Linux SL release 5.3 (Boron)
PPS-glite-WN-version-3.2.3-0
PPS-glite-TORQUE_client-3.2.1-0
torque-mom-2.3.0-snap.200801151629.2cri.sl5
torque-client-2.3.0-snap.200801151629.2cri.sl5
torque-2.3.0-snap.200801151629.2cri.sl5
glite-yaim-torque-client-4.0.1-1
Linux tb019.desy.de 2.6.18-128.1.14.el5 #1 SMP Tue Jun 16 18:47:37 EDT
2009 x86_64 x86_64 x86_64 GNU/Linux
4.7 client:
Scientific Linux SL release 4.7 (Beryllium)
PPS-glite-TORQUE_client-3.1.8-0.i386
PPS-glite-WN-3.1.35-0.i386
torque-docs-2.3.6-1cri.slc4.i386
torque-client-2.3.6-1cri.slc4.i386
glite-yaim-torque-client-4.0.2-1.noarch
torque-mom-2.3.6-1cri.slc4.i386
torque-pam-2.3.6-1cri.slc4.i386
torque-devel-2.3.6-1cri.slc4.i386
torque-2.3.6-1cri.slc4.i386
Linux tb018.desy.de 2.6.9-78.0.22.ELsmp #1 SMP Thu Apr 30 23:30:54 CDT
2009 i686 i686 i386 GNU/Linux
Thanks for any help,
Dima.
P.S. (i can ssh from the client to server without password)
|