JISCMail - LCG-ROLLOUT Archives

Hi Patrick,
You can also check the reverse lookup from you DNS. Your CE and WN
should be able to reverse lookup each other, it is a known problem with
torque.

Regards
Sajjad Asghar
-----Original Message-----
From: LHC Computer Grid - Rollout
[mailto:[log in to unmask]] On Behalf Of Patrick Guio
Sent: Wednesday, December 07, 2005 6:53 PM
To: [log in to unmask]
Subject: [LCG-ROLLOUT] maui/torque trouble

Dear Support,

I am in the process of installing a LCG2.6.0  CE_torque and WN_torque on
a 
cluster machine running Rocks 3.3.0 (Makalu).
The native torque is 1.0.1p6-1 and is packed in a different way (single
rpm 
containing everything: pbs and mom server, and mom and other clients
(cli and 
gui) and was working fine.

I first installed the LCG2.6.0 "manually" (using diverses solution pure
rpm, 
yum, apt). Run the yaim configure  (for dteam VO). Queue system was
created 
properly and I could submit globus jobs to the
lcg queue (dteam). I added manually with qmgr a "non-default" default
queue 
which worked fine.

Now I wanted to use the yaim install and the meta rpms for nodes that
contain 
nothing else than dependencies. I had to remove the native torque and
get 
installed the 1.0.1p6-11.SL30X.st version which is packed in several
rpms.

On CE_torque:
torque-1.0.1p6-11.SL30X.st
torque-resmom-1.0.1p6-11.SL30X.st
torque-server-1.0.1p6-11.SL30X.st
torque-clients-1.0.1p6-11.SL30X.st
torque-devel-1.0.1p6-11.SL30X.st
lcg-CE_torque-2.6.0-sl3

on WN_torque:
torque-clients-1.0.1p6-11.SL30X.st
torque-1.0.1p6-11.SL30X.st
lcg-WN_torque-2.6.0-sl3
torque-resmom-1.0.1p6-11.SL30X.st

I noticed that configuration files are not in the same place
(/opt/torque/ for 
native torque vs /var/spool/pbs/ for sl30x) and rerun yaim configuration
for 
both CE and WN and restarted maui, pbs_server, pbs_mom on CE and pbs_mom
on WN.
That's where I get into trouble! It seems that jobs get queued but for
some 
reason never execute. They stay in the 'Q' state forever.
% qsub -q default test-pbs.sh
1037.grid.bccs.uib.no
% qstat -a
grid.bccs.uib.no:
                                                             Req'd
Req'd
Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time
S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
1037.grid.bccs. patrickg default  test-pbs.s    --   --  --    --  48:00
Q   --

The same happen if I submit a globus job
% globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue dteam 
/bin/hostname
% qstat -a
grid.bccs.uib.no:
                                                             Req'd
Req'd Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time
S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
1038.grid.bccs. dteam001 dteam    STDIN         --    1  --    --  48:00
Q   --


Running pbsnodes, I get such output
compute-0-0.local
      state = state-unknown,down
      np = 2
      properties = lcgpro
      ntype = cluster


Looking at the pbs_mom log (in /var/spool/pbs/mom_logs), it seems it is
first 
reading the config file from the WM (/var/spool/pbs/mom_priv/config)
$clienthost grid.bccs.uib.no
$clienthost grid.local
$clienthost localhost
$clienthost localhost.localdomain
$restricted grid.bccs.uib.no
$restricted grid.local
$logevent 511
$ideal_load 1.6
$max_load 2.1
$usecp grid.bccs.uib.no:/home /home

pbs_mom;Svr;Log;Log opened
pbs_mom;Svr;restricted;grid.bccs.uib.no
[snip]
pbs_mom;Svr;usecp;grid.bccs.uib.no:/home /home
pbs_mom;n/a;initialize;independent
pbs_mom;Svr;pbs_mom;Is up

and then many lines with the same message:
pbs_mom;Svr;pbs_mom;im_eof, End of File from addr 129.177.120.153:15001

In the pbs_server log (in /var/spool/pbs/server_logs), there are lines:

Type disconnect request received from [log in to unmask], sock=9
Type statusqueue request received from [log in to unmask], sock=9
Type statusjob request received from [log in to unmask], sock=9

On the working node pbs_mom log there are many lines (about every three 
minutes)
Premature end of message from addr 129.177.120.153:15001

The maui run is 3.2.6p11-2_SL30X
% rpm -qa|grep maui
maui-server-3.2.6p11-2_SL30X
maui-3.2.6p11-2_SL30X
maui-client-3.2.6p11-2_SL30X
and I can see in /var/log/maui.log on CE there are lines like
INFO:     PBS node compute-0-0.local set to state Down
(state-unknown,down)
INFO:     0 PBS resources detected on RM base
WARNING:  no resources detected

There seems to a problem of communication between CE and WN.
Does the torque expect communication through something different than
ssh/scp? 
The Rocks torque works with ssh and rshd is disabled.
Could it be the problem?

Also there seems to be some problem with packaging

% rpm -qa | grep ^torque | xargs rpm -qV
S.5....T c /etc/sysconfig/pbs
..?.....   /usr/sbin/pbs_mom
..?.....   /usr/sbin/pbs_server
missing    /var/spool/pbs/server_priv/accounting
missing    /var/spool/pbs/server_priv/acl_groups
missing    /var/spool/pbs/server_priv/acl_hosts
missing    /var/spool/pbs/server_priv/acl_svr
missing    /var/spool/pbs/server_priv/acl_users
missing    /var/spool/pbs/server_priv/jobs
missing    /var/spool/pbs/server_priv/queues
S.5....T c /var/spool/pbs/server_name

even though
% rpm -qf /var/spool/pbs/server_priv/acl_svr
torque-server-1.0.1p6-11.SL30X.st

Do you any idea what is going wrong?

I found at Steve Staylen's site (http://hepunx.rl.ac.uk/~traylens) newer

versions of torque rpms for sl3 (2.0.0p1-2.sl3.st) than the one provided
by
CERN but still with same packaging.

Are these test rpms? Can they be used for LCG production? Do you think
it 
will solve my problem?

Any help is appreciated

Sincerely,

Patrick