-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
You can also use WN-rpm and CE-rpm they are not including torque / pbs.
- --
Laird Louis Poncet
Where: Bat28-R-003 CERN
CH-1211 Geneve 23
Mail : [log in to unmask]
Phone: +41(0)227.674.231
LAL / IN2P3 / CNRS / CERN
Problem >> RTFM then google it !
Le 7 déc. 05 à 16:24, Patrick Guio a écrit :
> On Wed, 7 Dec 2005 16:23:33 +0200, Dan Schrager
> <[log in to unmask]> wrote:
>
> Hi Dan,
>
> Yes I can ssh to a WN, become user dteam001 and then ssh back to
> the CE as
> this user without any password prompt.
>
> Cheers, Patrick
>
>
>> Hi Patrick,
>>
>> Does your ssh hostbased authentication work ?
>> Can you ssh to a WN, become user (su - ) atlas001 and then ssh to
>> the CE
>> and log into the CE WITHOUT any password promt ?
>>
>> If not, do on CE:
>>
>> for i in `cat wn-list.conf`; do
>> ssh-keyscan -t rsa $i.<your domain> >>/etc/ssh/ssh_known_hosts
>> done
>>
>> where $i.<your domain> is the full name of each WN and wn-
>> list.conf is
>> the list of WNs (short names)
>>
>> It might help.
>>
>> Regards,
>> Dan
>>
>>
>> Patrick Guio wrote:
>>
>>> Dear Support,
>>>
>>> I am in the process of installing a LCG2.6.0 CE_torque and
>>> WN_torque
>>> on a cluster machine running Rocks 3.3.0 (Makalu).
>>> The native torque is 1.0.1p6-1 and is packed in a different way
>>> (single rpm containing everything: pbs and mom server, and mom and
>>> other clients (cli and gui) and was working fine.
>>>
>>> I first installed the LCG2.6.0 "manually" (using diverses solution
>>> pure rpm, yum, apt). Run the yaim configure (for dteam VO). Queue
>>> system was created properly and I could submit globus jobs to the
>>> lcg queue (dteam). I added manually with qmgr a "non-default"
>>> default
>>> queue which worked fine.
>>>
>>> Now I wanted to use the yaim install and the meta rpms for nodes
>>> that
>>> contain nothing else than dependencies. I had to remove the native
>>> torque and get installed the 1.0.1p6-11.SL30X.st version which is
>>> packed in several rpms.
>>>
>>> On CE_torque:
>>> torque-1.0.1p6-11.SL30X.st
>>> torque-resmom-1.0.1p6-11.SL30X.st
>>> torque-server-1.0.1p6-11.SL30X.st
>>> torque-clients-1.0.1p6-11.SL30X.st
>>> torque-devel-1.0.1p6-11.SL30X.st
>>> lcg-CE_torque-2.6.0-sl3
>>>
>>> on WN_torque:
>>> torque-clients-1.0.1p6-11.SL30X.st
>>> torque-1.0.1p6-11.SL30X.st
>>> lcg-WN_torque-2.6.0-sl3
>>> torque-resmom-1.0.1p6-11.SL30X.st
>>>
>>> I noticed that configuration files are not in the same place
>>> (/opt/torque/ for native torque vs /var/spool/pbs/ for sl30x) and
>>> rerun yaim configuration for both CE and WN and restarted maui,
>>> pbs_server, pbs_mom on CE and pbs_mom on WN.
>>> That's where I get into trouble! It seems that jobs get queued
>>> but for
>>> some reason never execute. They stay in the 'Q' state forever.
>>> % qsub -q default test-pbs.sh
>>> 1037.grid.bccs.uib.no
>>> % qstat -a
>>> grid.bccs.uib.no:
>>>
>>> Req'd Req'd
>>> Elap
>>> Job ID Username Queue Jobname SessID NDS TSK Memory
>>> Time S Time
>>> --------------- -------- -------- ---------- ------ --- --- ------
>>> ----- - -----
>>> 1037.grid.bccs. patrickg default test-pbs.s -- -- -- --
>>> 48:00 Q --
>>>
>>> The same happen if I submit a globus job
>>> % globus-job-run grid.bccs.uib.no:2119/jobmanager-lcgpbs -queue
>>> dteam
>>> /bin/hostname
>>> % qstat -a
>>> grid.bccs.uib.no:
>>> Req'd
>>> Req'd Elap
>>> Job ID Username Queue Jobname SessID NDS TSK Memory
>>> Time S Time
>>> --------------- -------- -------- ---------- ------ --- --- ------
>>> ----- - -----
>>> 1038.grid.bccs. dteam001 dteam STDIN -- 1 -- --
>>> 48:00 Q --
>>>
>>>
>>> Running pbsnodes, I get such output
>>> compute-0-0.local
>>> state = state-unknown,down
>>> np = 2
>>> properties = lcgpro
>>> ntype = cluster
>>>
>>>
>>> Looking at the pbs_mom log (in /var/spool/pbs/mom_logs), it seems it
>>> is first reading the config file from the WM
>>> (/var/spool/pbs/mom_priv/config)
>>> $clienthost grid.bccs.uib.no
>>> $clienthost grid.local
>>> $clienthost localhost
>>> $clienthost localhost.localdomain
>>> $restricted grid.bccs.uib.no
>>> $restricted grid.local
>>> $logevent 511
>>> $ideal_load 1.6
>>> $max_load 2.1
>>> $usecp grid.bccs.uib.no:/home /home
>>>
>>> pbs_mom;Svr;Log;Log opened
>>> pbs_mom;Svr;restricted;grid.bccs.uib.no
>>> [snip]
>>> pbs_mom;Svr;usecp;grid.bccs.uib.no:/home /home
>>> pbs_mom;n/a;initialize;independent
>>> pbs_mom;Svr;pbs_mom;Is up
>>>
>>> and then many lines with the same message:
>>> pbs_mom;Svr;pbs_mom;im_eof, End of File from addr
>>> 129.177.120.153:15001
>>>
>>> In the pbs_server log (in /var/spool/pbs/server_logs), there are
>>> lines:
>>>
>>> Type disconnect request received from [log in to unmask], sock=9
>>> Type statusqueue request received from [log in to unmask], sock=9
>>> Type statusjob request received from [log in to unmask], sock=9
>>>
>>> On the working node pbs_mom log there are many lines (about every
>>> three minutes)
>>> Premature end of message from addr 129.177.120.153:15001
>>>
>>> The maui run is 3.2.6p11-2_SL30X
>>> % rpm -qa|grep maui
>>> maui-server-3.2.6p11-2_SL30X
>>> maui-3.2.6p11-2_SL30X
>>> maui-client-3.2.6p11-2_SL30X
>>> and I can see in /var/log/maui.log on CE there are lines like
>>> INFO: PBS node compute-0-0.local set to state Down
>>> (state-unknown,down)
>>> INFO: 0 PBS resources detected on RM base
>>> WARNING: no resources detected
>>>
>>> There seems to a problem of communication between CE and WN.
>>> Does the torque expect communication through something different
>>> than
>>> ssh/scp? The Rocks torque works with ssh and rshd is disabled.
>>> Could it be the problem?
>>>
>>> Also there seems to be some problem with packaging
>>>
>>> % rpm -qa | grep ^torque | xargs rpm -qV
>>> S.5....T c /etc/sysconfig/pbs
>>> ..?..... /usr/sbin/pbs_mom
>>> ..?..... /usr/sbin/pbs_server
>>> missing /var/spool/pbs/server_priv/accounting
>>> missing /var/spool/pbs/server_priv/acl_groups
>>> missing /var/spool/pbs/server_priv/acl_hosts
>>> missing /var/spool/pbs/server_priv/acl_svr
>>> missing /var/spool/pbs/server_priv/acl_users
>>> missing /var/spool/pbs/server_priv/jobs
>>> missing /var/spool/pbs/server_priv/queues
>>> S.5....T c /var/spool/pbs/server_name
>>>
>>> even though
>>> % rpm -qf /var/spool/pbs/server_priv/acl_svr
>>> torque-server-1.0.1p6-11.SL30X.st
>>>
>>> Do you any idea what is going wrong?
>>>
>>> I found at Steve Staylen's site (http://hepunx.rl.ac.uk/~traylens)
>>> newer versions of torque rpms for sl3 (2.0.0p1-2.sl3.st) than the
>>> one
>>> provided by
>>> CERN but still with same packaging.
>>>
>>> Are these test rpms? Can they be used for LCG production? Do you
>>> think
>>> it will solve my problem?
>>>
>>> Any help is appreciated
>>>
>>> Sincerely,
>>>
>>> Patrick
>>>
>>> +++++++++++++++++++++++++++++++++++++++++++
>>> This Mail Was Scanned By Mail-seCure System
>>> at the Tel-Aviv University CC.
>>
>>
>>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
iD8DBQFDlwnTcwxp4zw5dc8RAoAhAJ9jBt/FLhB4w4p+I27LVN6Y/fkcCwCfYPIU
WjuWz3mFcHq7RxRImoy5y2w=
=/CMP
-----END PGP SIGNATURE-----
|