I've got the same error each time I
try to configure torque via yaim...as you noticed, the problem is related
to the communication between CE's torque server and WN's torque client
(and not to ssh conf: in this case you will be able to run jobs but not
to retrieve output):
Check this:
on WN side:
" /var/spool/pbs/pbs_server"
should contain the pbs server's hostname : if WN is in a private lan, check
that such hostname reflect the name of the internal interface. Control/edit
also /etc/hosts file
example (my configuration):
CE side
CE hostname=
ce01.domain (==> external interface)
master.domain (==> internal interface)
>> cat /var/spool/pbs/pbs_server
ce01.domain
WN side (NATted)
>> cat /var/spool/pbs/pbs_server
master.domain
restarting pbs_mom on WN will affect
immediatly the "pbsnodes -a" command : edit/modify files and
restart pbs client until you'll get the "free" state for your
WN
...of course, you should also check
the firewall on the CE and WN (pbs/torque needs at least 15001-15003 tcp/udp
ports open).
Cheers
Vega Forneris
+-----------------------------------------------+
ESA-ESRIN
Unix Systems Administrator
Via Galileo Galilei
00044 Frascati (Rm) - Italy
Phone +39 06 94180581
Mailto: [log in to unmask]
+-----------------------------------------------+
Vitrociset S.p.A.
Unix System Administrator
Via Tiburtina 1020
00100 Roma - Italy
Phone +39 06 8820 4297
Mailto: [log in to unmask]
+-----------------------------------------------+
I am in the process of installing a LCG2.6.0 CE_torque and WN_torque
on a
cluster machine running Rocks 3.3.0 (Makalu).
The native torque is 1.0.1p6-1 and is packed in a different way (single
rpm
containing everything: pbs and mom server, and mom and other clients (cli
and
gui) and was working fine.
I first installed the LCG2.6.0 "manually" (using diverses solution
pure rpm,
yum, apt). Run the yaim configure (for dteam VO). Queue system was
created
properly and I could submit globus jobs to the
lcg queue (dteam). I added manually with qmgr a "non-default"
default queue
which worked fine.
Now I wanted to use the yaim install and the meta rpms for nodes that contain
nothing else than dependencies. I had to remove the native torque and get
installed the 1.0.1p6-11.SL30X.st version which is packed in several rpms.
On CE_torque:
torque-1.0.1p6-11.SL30X.st
torque-resmom-1.0.1p6-11.SL30X.st
torque-server-1.0.1p6-11.SL30X.st
torque-clients-1.0.1p6-11.SL30X.st
torque-devel-1.0.1p6-11.SL30X.st
lcg-CE_torque-2.6.0-sl3
on WN_torque:
torque-clients-1.0.1p6-11.SL30X.st
torque-1.0.1p6-11.SL30X.st
lcg-WN_torque-2.6.0-sl3
torque-resmom-1.0.1p6-11.SL30X.st
I noticed that configuration files are not in the same place (/opt/torque/
for
native torque vs /var/spool/pbs/ for sl30x) and rerun yaim configuration
for
both CE and WN and restarted maui, pbs_server, pbs_mom on CE and pbs_mom
on WN.
That's where I get into trouble! It seems that jobs get queued but for
some
reason never execute. They stay in the 'Q' state forever.
% qsub -q default test-pbs.sh
1037.grid.bccs.uib.no
% qstat -a
grid.bccs.uib.no:
Req'd Req'd
Elap
Job ID Username Queue Jobname
SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
1037.grid.bccs. patrickg default test-pbs.s --
-- -- -- 48:00 Q --
The same happen if I submit a globus job
% globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue dteam
/bin/hostname
% qstat -a
grid.bccs.uib.no:
Req'd Req'd
Elap
Job ID Username Queue Jobname
SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
1038.grid.bccs. dteam001 dteam STDIN
-- 1 -- -- 48:00 Q
--
Running pbsnodes, I get such output
compute-0-0.local
state = state-unknown,down
np = 2
properties = lcgpro
ntype = cluster
Looking at the pbs_mom log (in /var/spool/pbs/mom_logs), it seems it is
first
reading the config file from the WM (/var/spool/pbs/mom_priv/config)
$clienthost grid.bccs.uib.no
$clienthost grid.local
$clienthost localhost
$clienthost localhost.localdomain
$restricted grid.bccs.uib.no
$restricted grid.local
$logevent 511
$ideal_load 1.6
$max_load 2.1
$usecp grid.bccs.uib.no:/home /home
pbs_mom;Svr;Log;Log opened
pbs_mom;Svr;restricted;grid.bccs.uib.no
[snip]
pbs_mom;Svr;usecp;grid.bccs.uib.no:/home /home
pbs_mom;n/a;initialize;independent
pbs_mom;Svr;pbs_mom;Is up
and then many lines with the same message:
pbs_mom;Svr;pbs_mom;im_eof, End of File from addr 129.177.120.153:15001
In the pbs_server log (in /var/spool/pbs/server_logs), there are lines:
On the working node pbs_mom log there are many lines (about every three
minutes)
Premature end of message from addr 129.177.120.153:15001
The maui run is 3.2.6p11-2_SL30X
% rpm -qa|grep maui
maui-server-3.2.6p11-2_SL30X
maui-3.2.6p11-2_SL30X
maui-client-3.2.6p11-2_SL30X
and I can see in /var/log/maui.log on CE there are lines like
INFO: PBS node compute-0-0.local set to state Down (state-unknown,down)
INFO: 0 PBS resources detected on RM base
WARNING: no resources detected
There seems to a problem of communication between CE and WN.
Does the torque expect communication through something different than ssh/scp?
The Rocks torque works with ssh and rshd is disabled.
Could it be the problem?
Also there seems to be some problem with packaging
even though
% rpm -qf /var/spool/pbs/server_priv/acl_svr
torque-server-1.0.1p6-11.SL30X.st
Do you any idea what is going wrong?
I found at Steve Staylen's site (http://hepunx.rl.ac.uk/~traylens) newer
versions of torque rpms for sl3 (2.0.0p1-2.sl3.st) than the one provided
by
CERN but still with same packaging.
Are these test rpms? Can they be used for LCG production? Do you think
it
will solve my problem?