Hi Joe.
This is strange: the "any WN must be able to ssh to any other WN"
requirement has been there at least since release 1 of the EDG software,
last year and we, meaning LCG, did not touch that part of the code,
especially not between LCG1-1_1_0 and LCG1-1_1_1.
A few changes to the pbs jobmanager were added between LCG1-1_0_1 and
LCG1-1_1_0, but in a different part of the code. In any case I will
investigate with David Smith who did the modifications to the jobmanager.
Cheers
Emanuele
Joe Kaiser wrote:
> Hi,
>
> Thanks very much for your help. I have been trying to figure out what I
> had done wrong in my PBS configuration. This proved enlightening. My
> comments are below.
>
> On Wed, 2003-11-05 at 02:53, Emanuele LEONARDI wrote:
>
>>Hi Joe.
>>
>>Let me see if I understood your problem:
>>
>>1) you logon the CE, su to a virtual account, submit a job with qsub and
>>this works fine (i.e. pbs is correctly configured)
>>2) you run a globus-job-run command from your UI to the CE using the
>>default fork jobmanager and this also works (i.e. the gatekeeper is
>>correctly configured)
>>3) you run a jdl with edg-job-submit to your CE and it works fine (i.e.
>>the whole system seems to work fine)
>>4) you run a globus-job-run command from your UI to the CE using the pbs
>>jobmanager and this fails
>>
>
>
> These are all correct.
>
>
>>If this description is correct (my tests show this, anyway) then I know
>>what is the problem. The pbs jobmanager from globus has an interesting
>>feature: when a job is submitted without specifying if it is a simple
>>job or an MPI one, the jobmanager defaults to MPI. MPI jobs are handled
>>fine buy most batch systems but not by PBS, so the jobmanager usees a
>>dirty trick: it wraps the job into a special script and then submits it
>>as a non-MPI job. When this script is started it spawns all the sub-jobs
>>composing the MPI job, sending one job per known WN. Of course it
>>assumes that each and every WN can do a passwordless ssh connection to
>>any other WN. Now, if the MPI job is in fact a simple job which appears
>>as MPI because of the jobmanager defaulting to it, then the wrapper is
>>added anyway and when the job starts on a WN it tries and uses ssh to
>>run a job on... the WN itself.
>>
>>If at your site passwordless ssh connection is completely disabled, then
>>the wrapper will not be able to run the job with ssh and you get the
>>message you see.
>>
>
> This is true for our site. We do not have the option for ssh
> communication unless kerberos authentication becomes a part of PBS.
>
>
>>If you did not explicitly change anything on the WNs to avoid ssh
>>passwordless login, then the problem might be the
>>/etc/ssh/ssh_known_hosts file containing some old public keys which are
>>not valid anymore (this may happen if you reinstalled some of the WNs or
>>the CE). In this case, you may try and delete all
>>/etc/ssh/ssh_known_hosts files on your WNs and your CE and then run on
>>each node the /opt/edg/sbin/edg-pbs-knownhosts (this is run every 6
>>hours anyway). Please let me know if you try this.
>>
>>As for the general weirdness of the way globus-job-run handles jobs on
>>pbs, Steve Traylen (I think) posted a couple of days ago a way to force
>>jobs to be tagged as non-MPI:
>>
>>globus-job-run hotdog46.fnal.gov/jobmanager-pbs -x '&(jobType=single)'
>>/bin/pwd'&(jobType=single)'
>>
>
>
> This worked though without having to specify '&(jobType=single)' before
> I upgraded to LCG1-1_1_1. Was there a pbs upgrade that included the MPI
> "feature"?
>
> Thanks much,
>
> Joe
>
>
>>I tested it on your system and it works fine.
>>
>>Cheers
>>
>> Emanuele
>>Joe Kaiser wrote:
>>
>>>Hi,
>>>
>>>I am stuck on PBS again. I have followed the setup in the
>>>lcg1-notes.txt for not using ssh, i.e. using PBS with shared home
>>>areas. Still, when I submit a job I get a:
>>>
>>> Permission denied (external-keyx,gssapi,keyboard-interactive).
>>>
>>>Which is an ssh error. I can run jobs on the internal PBS system, i.e.
>>>if I submit from my head node the job runs and returns correctly.
>>>
>>>NO_SHARE_HOME and CE_JM_TYPE are undefined per the instructions.
>>>
>>>Any ideas of things I should check?
>>>
>>>Thanks,
>>>
>>>Joe
>>>
>>>--
>>>===================================================================
>>>Joe Kaiser - Systems Administrator
>>>
>>>Fermi Lab
>>>CD/OSS-SCS Never laugh at live dragons.
>>>630-840-6444
>>>[log in to unmask]
>>>===================================================================
>>
>>
>>--
>>/------------------- Emanuele Leonardi -------------------\
>>| eMail: [log in to unmask] - Tel.: +41-22-7674066 |
>>| IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
>>\---------------------------------------------------------/
>
> --
> ===================================================================
> Joe Kaiser - Systems Administrator
>
> Fermi Lab
> CD/OSS-SCS Never laugh at live dragons.
> 630-840-6444
> [log in to unmask]
> ===================================================================
--
/------------------- Emanuele Leonardi -------------------\
| eMail: [log in to unmask] - Tel.: +41-22-7674066 |
| IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
\---------------------------------------------------------/
|