Hi Patrick,
Does your ssh hostbased authentication work ?
Can you ssh to a WN, become user (su - ) atlas001 and then ssh to the CE
and log into the CE WITHOUT any password promt ?
If not, do on CE:
for i in `cat wn-list.conf`; do
ssh-keyscan -t rsa $i.<your domain> >>/etc/ssh/ssh_known_hosts
done
where $i.<your domain> is the full name of each WN and wn-list.conf is
the list of WNs (short names)
It might help.
Regards,
Dan
Patrick Guio wrote:
> Dear Support,
>
> I am in the process of installing a LCG2.6.0 CE_torque and WN_torque
> on a cluster machine running Rocks 3.3.0 (Makalu).
> The native torque is 1.0.1p6-1 and is packed in a different way
> (single rpm containing everything: pbs and mom server, and mom and
> other clients (cli and gui) and was working fine.
>
> I first installed the LCG2.6.0 "manually" (using diverses solution
> pure rpm, yum, apt). Run the yaim configure (for dteam VO). Queue
> system was created properly and I could submit globus jobs to the
> lcg queue (dteam). I added manually with qmgr a "non-default" default
> queue which worked fine.
>
> Now I wanted to use the yaim install and the meta rpms for nodes that
> contain nothing else than dependencies. I had to remove the native
> torque and get installed the 1.0.1p6-11.SL30X.st version which is
> packed in several rpms.
>
> On CE_torque:
> torque-1.0.1p6-11.SL30X.st
> torque-resmom-1.0.1p6-11.SL30X.st
> torque-server-1.0.1p6-11.SL30X.st
> torque-clients-1.0.1p6-11.SL30X.st
> torque-devel-1.0.1p6-11.SL30X.st
> lcg-CE_torque-2.6.0-sl3
>
> on WN_torque:
> torque-clients-1.0.1p6-11.SL30X.st
> torque-1.0.1p6-11.SL30X.st
> lcg-WN_torque-2.6.0-sl3
> torque-resmom-1.0.1p6-11.SL30X.st
>
> I noticed that configuration files are not in the same place
> (/opt/torque/ for native torque vs /var/spool/pbs/ for sl30x) and
> rerun yaim configuration for both CE and WN and restarted maui,
> pbs_server, pbs_mom on CE and pbs_mom on WN.
> That's where I get into trouble! It seems that jobs get queued but for
> some reason never execute. They stay in the 'Q' state forever.
> % qsub -q default test-pbs.sh
> 1037.grid.bccs.uib.no
> % qstat -a
> grid.bccs.uib.no:
> Req'd Req'd
> Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory
> Time S Time
> --------------- -------- -------- ---------- ------ --- --- ------
> ----- - -----
> 1037.grid.bccs. patrickg default test-pbs.s -- -- -- --
> 48:00 Q --
>
> The same happen if I submit a globus job
> % globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue dteam
> /bin/hostname
> % qstat -a
> grid.bccs.uib.no:
> Req'd
> Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory
> Time S Time
> --------------- -------- -------- ---------- ------ --- --- ------
> ----- - -----
> 1038.grid.bccs. dteam001 dteam STDIN -- 1 -- --
> 48:00 Q --
>
>
> Running pbsnodes, I get such output
> compute-0-0.local
> state = state-unknown,down
> np = 2
> properties = lcgpro
> ntype = cluster
>
>
> Looking at the pbs_mom log (in /var/spool/pbs/mom_logs), it seems it
> is first reading the config file from the WM
> (/var/spool/pbs/mom_priv/config)
> $clienthost grid.bccs.uib.no
> $clienthost grid.local
> $clienthost localhost
> $clienthost localhost.localdomain
> $restricted grid.bccs.uib.no
> $restricted grid.local
> $logevent 511
> $ideal_load 1.6
> $max_load 2.1
> $usecp grid.bccs.uib.no:/home /home
>
> pbs_mom;Svr;Log;Log opened
> pbs_mom;Svr;restricted;grid.bccs.uib.no
> [snip]
> pbs_mom;Svr;usecp;grid.bccs.uib.no:/home /home
> pbs_mom;n/a;initialize;independent
> pbs_mom;Svr;pbs_mom;Is up
>
> and then many lines with the same message:
> pbs_mom;Svr;pbs_mom;im_eof, End of File from addr 129.177.120.153:15001
>
> In the pbs_server log (in /var/spool/pbs/server_logs), there are lines:
>
> Type disconnect request received from [log in to unmask], sock=9
> Type statusqueue request received from [log in to unmask], sock=9
> Type statusjob request received from [log in to unmask], sock=9
>
> On the working node pbs_mom log there are many lines (about every
> three minutes)
> Premature end of message from addr 129.177.120.153:15001
>
> The maui run is 3.2.6p11-2_SL30X
> % rpm -qa|grep maui
> maui-server-3.2.6p11-2_SL30X
> maui-3.2.6p11-2_SL30X
> maui-client-3.2.6p11-2_SL30X
> and I can see in /var/log/maui.log on CE there are lines like
> INFO: PBS node compute-0-0.local set to state Down
> (state-unknown,down)
> INFO: 0 PBS resources detected on RM base
> WARNING: no resources detected
>
> There seems to a problem of communication between CE and WN.
> Does the torque expect communication through something different than
> ssh/scp? The Rocks torque works with ssh and rshd is disabled.
> Could it be the problem?
>
> Also there seems to be some problem with packaging
>
> % rpm -qa | grep ^torque | xargs rpm -qV
> S.5....T c /etc/sysconfig/pbs
> ..?..... /usr/sbin/pbs_mom
> ..?..... /usr/sbin/pbs_server
> missing /var/spool/pbs/server_priv/accounting
> missing /var/spool/pbs/server_priv/acl_groups
> missing /var/spool/pbs/server_priv/acl_hosts
> missing /var/spool/pbs/server_priv/acl_svr
> missing /var/spool/pbs/server_priv/acl_users
> missing /var/spool/pbs/server_priv/jobs
> missing /var/spool/pbs/server_priv/queues
> S.5....T c /var/spool/pbs/server_name
>
> even though
> % rpm -qf /var/spool/pbs/server_priv/acl_svr
> torque-server-1.0.1p6-11.SL30X.st
>
> Do you any idea what is going wrong?
>
> I found at Steve Staylen's site (http://hepunx.rl.ac.uk/~traylens)
> newer versions of torque rpms for sl3 (2.0.0p1-2.sl3.st) than the one
> provided by
> CERN but still with same packaging.
>
> Are these test rpms? Can they be used for LCG production? Do you think
> it will solve my problem?
>
> Any help is appreciated
>
> Sincerely,
>
> Patrick
>
> +++++++++++++++++++++++++++++++++++++++++++
> This Mail Was Scanned By Mail-seCure System
> at the Tel-Aviv University CC.
|