You can try to put the entry of your server(CE) name in the file
/etc/hosts and vice versa on the WNS.
-----Original Message-----
From: LHC Computer Grid - Rollout
[mailto:[log in to unmask]] On Behalf Of Patrick Guio
Sent: Thursday, December 08, 2005 2:58 PM
To: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] maui/torque trouble
On Thu, 8 Dec 2005 11:53:26 +0500, Sajjad Asghar <[log in to unmask]>
wrote:
Hi Sajjad,
Could you give some details on how to do that?
Sincerely,
Patrick
>Hi Patrick,
>You can also check the reverse lookup from you DNS. Your CE and WN
>should be able to reverse lookup each other, it is a known problem with
>torque.
>
>Regards
>Sajjad Asghar
>-----Original Message-----
>From: LHC Computer Grid - Rollout
>[mailto:[log in to unmask]] On Behalf Of Patrick Guio
>Sent: Wednesday, December 07, 2005 6:53 PM
>To: [log in to unmask]
>Subject: [LCG-ROLLOUT] maui/torque trouble
>
>Dear Support,
>
>I am in the process of installing a LCG2.6.0 CE_torque and WN_torque
on
>a
>cluster machine running Rocks 3.3.0 (Makalu).
>The native torque is 1.0.1p6-1 and is packed in a different way (single
>rpm
>containing everything: pbs and mom server, and mom and other clients
>(cli and
>gui) and was working fine.
>
>I first installed the LCG2.6.0 "manually" (using diverses solution pure
>rpm,
>yum, apt). Run the yaim configure (for dteam VO). Queue system was
>created
>properly and I could submit globus jobs to the
>lcg queue (dteam). I added manually with qmgr a "non-default" default
>queue
>which worked fine.
>
>Now I wanted to use the yaim install and the meta rpms for nodes that
>contain
>nothing else than dependencies. I had to remove the native torque and
>get
>installed the 1.0.1p6-11.SL30X.st version which is packed in several
>rpms.
>
>On CE_torque:
>torque-1.0.1p6-11.SL30X.st
>torque-resmom-1.0.1p6-11.SL30X.st
>torque-server-1.0.1p6-11.SL30X.st
>torque-clients-1.0.1p6-11.SL30X.st
>torque-devel-1.0.1p6-11.SL30X.st
>lcg-CE_torque-2.6.0-sl3
>
>on WN_torque:
>torque-clients-1.0.1p6-11.SL30X.st
>torque-1.0.1p6-11.SL30X.st
>lcg-WN_torque-2.6.0-sl3
>torque-resmom-1.0.1p6-11.SL30X.st
>
>I noticed that configuration files are not in the same place
>(/opt/torque/ for
>native torque vs /var/spool/pbs/ for sl30x) and rerun yaim
configuration
>for
>both CE and WN and restarted maui, pbs_server, pbs_mom on CE and
pbs_mom
>on WN.
>That's where I get into trouble! It seems that jobs get queued but for
>some
>reason never execute. They stay in the 'Q' state forever.
>% qsub -q default test-pbs.sh
>1037.grid.bccs.uib.no
>% qstat -a
>grid.bccs.uib.no:
> Req'd
>Req'd
>Elap
>Job ID Username Queue Jobname SessID NDS TSK Memory Time
>S Time
>--------------- -------- -------- ---------- ------ --- --- ------
-----
>- -----
>1037.grid.bccs. patrickg default test-pbs.s -- -- -- --
48:00
>Q --
>
>The same happen if I submit a globus job
>% globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue dteam
>/bin/hostname
>% qstat -a
>grid.bccs.uib.no:
> Req'd
>Req'd Elap
>Job ID Username Queue Jobname SessID NDS TSK Memory Time
>S Time
>--------------- -------- -------- ---------- ------ --- --- ------
-----
>- -----
>1038.grid.bccs. dteam001 dteam STDIN -- 1 -- --
48:00
>Q --
>
>
>Running pbsnodes, I get such output
>compute-0-0.local
> state = state-unknown,down
> np = 2
> properties = lcgpro
> ntype = cluster
>
>
>Looking at the pbs_mom log (in /var/spool/pbs/mom_logs), it seems it is
>first
>reading the config file from the WM (/var/spool/pbs/mom_priv/config)
>$clienthost grid.bccs.uib.no
>$clienthost grid.local
>$clienthost localhost
>$clienthost localhost.localdomain
>$restricted grid.bccs.uib.no
>$restricted grid.local
>$logevent 511
>$ideal_load 1.6
>$max_load 2.1
>$usecp grid.bccs.uib.no:/home /home
>
>pbs_mom;Svr;Log;Log opened
>pbs_mom;Svr;restricted;grid.bccs.uib.no
>[snip]
>pbs_mom;Svr;usecp;grid.bccs.uib.no:/home /home
>pbs_mom;n/a;initialize;independent
>pbs_mom;Svr;pbs_mom;Is up
>
>and then many lines with the same message:
>pbs_mom;Svr;pbs_mom;im_eof, End of File from addr 129.177.120.153:15001
>
>In the pbs_server log (in /var/spool/pbs/server_logs), there are lines:
>
>Type disconnect request received from [log in to unmask], sock=9
>Type statusqueue request received from [log in to unmask], sock=9
>Type statusjob request received from [log in to unmask], sock=9
>
>On the working node pbs_mom log there are many lines (about every three
>minutes)
>Premature end of message from addr 129.177.120.153:15001
>
>The maui run is 3.2.6p11-2_SL30X
>% rpm -qa|grep maui
>maui-server-3.2.6p11-2_SL30X
>maui-3.2.6p11-2_SL30X
>maui-client-3.2.6p11-2_SL30X
>and I can see in /var/log/maui.log on CE there are lines like
>INFO: PBS node compute-0-0.local set to state Down
>(state-unknown,down)
>INFO: 0 PBS resources detected on RM base
>WARNING: no resources detected
>
>There seems to a problem of communication between CE and WN.
>Does the torque expect communication through something different than
>ssh/scp?
>The Rocks torque works with ssh and rshd is disabled.
>Could it be the problem?
>
>Also there seems to be some problem with packaging
>
>% rpm -qa | grep ^torque | xargs rpm -qV
>S.5....T c /etc/sysconfig/pbs
>..?..... /usr/sbin/pbs_mom
>..?..... /usr/sbin/pbs_server
>missing /var/spool/pbs/server_priv/accounting
>missing /var/spool/pbs/server_priv/acl_groups
>missing /var/spool/pbs/server_priv/acl_hosts
>missing /var/spool/pbs/server_priv/acl_svr
>missing /var/spool/pbs/server_priv/acl_users
>missing /var/spool/pbs/server_priv/jobs
>missing /var/spool/pbs/server_priv/queues
>S.5....T c /var/spool/pbs/server_name
>
>even though
>% rpm -qf /var/spool/pbs/server_priv/acl_svr
>torque-server-1.0.1p6-11.SL30X.st
>
>Do you any idea what is going wrong?
>
>I found at Steve Staylen's site (http://hepunx.rl.ac.uk/~traylens)
newer
>
>versions of torque rpms for sl3 (2.0.0p1-2.sl3.st) than the one
provided
>by
>CERN but still with same packaging.
>
>Are these test rpms? Can they be used for LCG production? Do you think
>it
>will solve my problem?
>
>Any help is appreciated
>
>Sincerely,
>
>Patrick
>=======================================================================
==
|