JISCMail - LCG-ROLLOUT Archives

Hi Patrick,

I've got the same error each time I try to configure torque via yaim...as you noticed, the problem is related to the communication between CE's torque server and WN's torque client (and not to ssh conf: in this case you will be able to run jobs but not to retrieve output):

Check this:
on WN side:
" /var/spool/pbs/pbs_server" should contain the pbs server's hostname : if WN is in a private lan, check that such hostname reflect the name of the internal interface. Control/edit also /etc/hosts file

example (my configuration):
CE side
CE hostname= ce01.domain (==> external interface)
master.domain (==> internal interface)
>> cat /var/spool/pbs/pbs_server
ce01.domain

WN side (NATted)
>> cat /var/spool/pbs/pbs_server
master.domain

restarting pbs_mom on WN will affect immediatly the "pbsnodes -a" command : edit/modify files and restart pbs client until you'll get the "free" state for your WN

...of course, you should also check the firewall on the CE and WN (pbs/torque needs at least 15001-15003 tcp/udp ports open).

Cheers

Vega Forneris

+-----------------------------------------------+
ESA-ESRIN
Unix Systems Administrator
Via Galileo Galilei
00044 Frascati (Rm) - Italy
Phone +39 06 94180581
Mailto: [log in to unmask]
+-----------------------------------------------+
Vitrociset S.p.A.
Unix System Administrator
Via Tiburtina 1020
00100 Roma - Italy
Phone +39 06 8820 4297
Mailto: [log in to unmask]
+-----------------------------------------------+

Patrick Guio <[log in to unmask]>
Sent by: LHC Computer Grid - Rollout <[log in to unmask]>

07/12/2005 14:52

Please respond to
LHC Computer Grid - Rollout <[log in to unmask]>

To	[log in to unmask]
cc
Subject	[LCG-ROLLOUT] maui/torque trouble

Dear Support, I am in the process of installing a LCG2.6.0 CE_torque and WN_torque on a cluster machine running Rocks 3.3.0 (Makalu). The native torque is 1.0.1p6-1 and is packed in a different way (single rpm containing everything: pbs and mom server, and mom and other clients (cli and gui) and was working fine. I first installed the LCG2.6.0 "manually" (using diverses solution pure rpm, yum, apt). Run the yaim configure (for dteam VO). Queue system was created properly and I could submit globus jobs to the lcg queue (dteam). I added manually with qmgr a "non-default" default queue which worked fine. Now I wanted to use the yaim install and the meta rpms for nodes that contain nothing else than dependencies. I had to remove the native torque and get installed the 1.0.1p6-11.SL30X.st version which is packed in several rpms. On CE_torque: torque-1.0.1p6-11.SL30X.st torque-resmom-1.0.1p6-11.SL30X.st torque-server-1.0.1p6-11.SL30X.st torque-clients-1.0.1p6-11.SL30X.st torque-devel-1.0.1p6-11.SL30X.st lcg-CE_torque-2.6.0-sl3 on WN_torque: torque-clients-1.0.1p6-11.SL30X.st torque-1.0.1p6-11.SL30X.st lcg-WN_torque-2.6.0-sl3 torque-resmom-1.0.1p6-11.SL30X.st I noticed that configuration files are not in the same place (/opt/torque/ for native torque vs /var/spool/pbs/ for sl30x) and rerun yaim configuration for both CE and WN and restarted maui, pbs_server, pbs_mom on CE and pbs_mom on WN. That's where I get into trouble! It seems that jobs get queued but for some reason never execute. They stay in the 'Q' state forever. % qsub -q default test-pbs.sh 1037.grid.bccs.uib.no % qstat -a grid.bccs.uib.no: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1037.grid.bccs. patrickg default test-pbs.s -- -- -- -- 48:00 Q -- The same happen if I submit a globus job % globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue dteam /bin/hostname % qstat -a grid.bccs.uib.no: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1038.grid.bccs. dteam001 dteam STDIN -- 1 -- -- 48:00 Q -- Running pbsnodes, I get such output compute-0-0.local state = state-unknown,down np = 2 properties = lcgpro ntype = cluster Looking at the pbs_mom log (in /var/spool/pbs/mom_logs), it seems it is first reading the config file from the WM (/var/spool/pbs/mom_priv/config) $clienthost grid.bccs.uib.no $clienthost grid.local $clienthost localhost $clienthost localhost.localdomain $restricted grid.bccs.uib.no $restricted grid.local $logevent 511 $ideal_load 1.6 $max_load 2.1 $usecp grid.bccs.uib.no:/home /home pbs_mom;Svr;Log;Log opened pbs_mom;Svr;restricted;grid.bccs.uib.no [snip] pbs_mom;Svr;usecp;grid.bccs.uib.no:/home /home pbs_mom;n/a;initialize;independent pbs_mom;Svr;pbs_mom;Is up and then many lines with the same message: pbs_mom;Svr;pbs_mom;im_eof, End of File from addr 129.177.120.153:15001 In the pbs_server log (in /var/spool/pbs/server_logs), there are lines: Type disconnect request received from [log in to unmask], sock=9 Type statusqueue request received from [log in to unmask], sock=9 Type statusjob request received from [log in to unmask], sock=9 On the working node pbs_mom log there are many lines (about every three minutes) Premature end of message from addr 129.177.120.153:15001 The maui run is 3.2.6p11-2_SL30X % rpm -qa|grep maui maui-server-3.2.6p11-2_SL30X maui-3.2.6p11-2_SL30X maui-client-3.2.6p11-2_SL30X and I can see in /var/log/maui.log on CE there are lines like INFO: PBS node compute-0-0.local set to state Down (state-unknown,down) INFO: 0 PBS resources detected on RM base WARNING: no resources detected There seems to a problem of communication between CE and WN. Does the torque expect communication through something different than ssh/scp? The Rocks torque works with ssh and rshd is disabled. Could it be the problem? Also there seems to be some problem with packaging % rpm -qa | grep ^torque | xargs rpm -qV S.5....T c /etc/sysconfig/pbs ..?..... /usr/sbin/pbs_mom ..?..... /usr/sbin/pbs_server missing /var/spool/pbs/server_priv/accounting missing /var/spool/pbs/server_priv/acl_groups missing /var/spool/pbs/server_priv/acl_hosts missing /var/spool/pbs/server_priv/acl_svr missing /var/spool/pbs/server_priv/acl_users missing /var/spool/pbs/server_priv/jobs missing /var/spool/pbs/server_priv/queues S.5....T c /var/spool/pbs/server_name even though % rpm -qf /var/spool/pbs/server_priv/acl_svr torque-server-1.0.1p6-11.SL30X.st Do you any idea what is going wrong? I found at Steve Staylen's site (http://hepunx.rl.ac.uk/~traylens) newer versions of torque rpms for sl3 (2.0.0p1-2.sl3.st) than the one provided by CERN but still with same packaging. Are these test rpms? Can they be used for LCG production? Do you think it will solve my problem? Any help is appreciated Sincerely, Patrick