Hi Stephen,
Thanks for that info, it was very useful. I was able to run my code on
our site (EFDA-JET).
Thanks also for supporting the fusion VO. I tried running the same code
at your site, but it failed.
It looks like an ssh authentication error between Worker Nodes.
Dave
#####################
The stderr said ...
#####################
0 bytes 0.00 KB/sec avg 0.00 KB/sec
inst@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
9f:13:8c:bf:bb:f3:3d:66:44:5a:77:41:bb:74:0e:a0.
Please contact your system administrator.
Add correct host key in /home/dte015/.ssh/known_hosts to get rid of this
message.
Offending key in /etc/ssh/ssh_known_hosts:237
RSA host key for cagnode72.cs.tcd.ie has changed and you have requested
strict checking.
Host key verification failed.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
cd:cc:f2:64:2f:d9:e1:6a:74:6b:f6:f2:f3:cb:c7:e2.
Please contact your system administrator.
Add correct host key in /home/dte015/.ssh/known_hosts to get rid of this
message.
Offending key in /etc/ssh/ssh_known_hosts:242
RSA host key for cagnode76.cs.tcd.ie has changed and you have requested
strict checking.
Host key verification failed.
/opt/mpi/bin/mpirun: line 18: cpi: command not found
######################
and the stdout said ...
######################
Modified mpirun: Executing command: ./cpi.sh
Using grid catalog type: lfc
Using grid catalog : rb-egee.bifi.unizar.es
Source URL: lfn:/grid/fusion/swiesen/cpi.tgz
File size: 137246
VO name: fusion
Source URL for copy:
gsiftp://grid001.jet.efda.org/grid001.jet.efda.org:/grid/dpm/fusion/2006-10-02/fileb2058830-fb8d-46b9-bd9a-40437e2501e2.8585.0
Destination URL:
file:/home/dte015/globus-tmp.cagnode56.18574.0/.mpi/https_3a_2f_2frb-egee.bifi.unizar.es_3a9000_2fPvMD5DbLv17rwwn-m6r8Ig/cpi.tgz
# streams: 1
# set timeout to 0 (seconds)
Transfer took 2040 ms
PBS Nodefile: /var/spool/pbs/aux//206576.gridgate.cs.tcd.ie
***********************************************************************
Node count: 4
Nodes in /var/spool/pbs/aux//206576.gridgate.cs.tcd.ie:
cagnode56.cs.tcd.ie
cagnode70.cs.tcd.ie
cagnode72.cs.tcd.ie
cagnode76.cs.tcd.ie
***********************************************************************
***********************************************************************
Checking ssh for each node:
Checking cagnode56.cs.tcd.ie...
cagnode56.cs.tcd.ie
Checking cagnode70.cs.tcd.ie...
cagnode70.cs.tcd.ie
Checking cagnode72.cs.tcd.ie...
Checking cagnode76.cs.tcd.ie...
***********************************************************************
Modified mpirun: Executing command: cpi
Stephen Childs wrote:
> David Robson wrote:
>
>> Hi,
>>
>> Several of our users are interested in running MPI jobs on the grid,
>> but none
>> have managed to get a job to run. It seems that only three sites
>> that support our
>> VO (fusion) also support mpich jobs. If I submit to
>> ce1.egee.fr.cgg.com or to
>> our own grid002.jet.efda.org, I get the following error.
>>
>> Cannot plan: JobAdapterHelper: invalid value torque for attribute
>> lrms_type (expecting lsf or pbs)
>
>
> This is due to limitations in the resource broker software, but there
> isn't an inherent problem with using torque if you have shared homes
> working. This page gives a workaround:
> http://goc.grid.sinica.edu.tw/gocwiki/MPI%2e_Cannot_plan%3a_JobAdapterHelper%3a_invalid_value_torque_for_attribute_lrms_type
>
>
> The one you want is:
> edg-job-submit --vo <VO_name> -r <CE_name> --lrms pbs myFile.jdl
>
> Also worth trying the test job referenced at the bottom of this page:
> http://goc.grid.sinica.edu.tw/gocwiki/MPI_Support_with_Torque
>
> We (csTCDie) have MPI running on our site and are keen for test users,
> so I will have a look at enabling the fusion VO if I get a minute
> later on.
>
> It's good to see that you're trying to run MPI jobs. I'm aware that
> support is currently poor but we are working on in and hopefully
> things will improve in the next few months.
>
>> Type = "job";
>> JobType = "mpich";
>> NodeNumber = 2;
>> InputSandbox = "cpi.sh";
>> Executable = "cpi.sh";
>> StdOutput = "std.out";
>> StdError = "std.err";
>> OutputSandbox = {"std.out","std.err"};
>> VirtualOrganisation="fusion";
>> RetryCount=7;
>>
>> I (rather naively) thought our site (EFDA-JET - grid002.jet.efda.org)
>> would support MPICH if we
>>
>> a) Had mpich installed
>> b) Had a comon home directory across WNs
>> c) had MPICH in our CE_RUNTIMEENV variable.
>
>
> Yes that should be enough to run MPI jobs but until the resource
> broker is updated it will remain awkward to submit through the
> standard mechanism. The other problem is that the RB will hardcode a
> call to mpirun so if that's not available or not configured then you
> might have problems.
>
>> This is clearly not the case. Should we be supportting lam
>> or openmpi or shouldn't we be contemplating running parallel grid
>> jobs at all?
>
> Please don't give up! If you have further queries, don't hesitate to
> mail me.
>
> Stephen
|