David Robson wrote:
> Hi,
>
> Several of our users are interested in running MPI jobs on the grid, but
> none
> have managed to get a job to run. It seems that only three sites that
> support our
> VO (fusion) also support mpich jobs. If I submit to ce1.egee.fr.cgg.com
> or to
> our own grid002.jet.efda.org, I get the following error.
>
> Cannot plan: JobAdapterHelper: invalid value torque for attribute
> lrms_type (expecting lsf or pbs)
This is due to limitations in the resource broker software, but there
isn't an inherent problem with using torque if you have shared homes
working. This page gives a workaround:
http://goc.grid.sinica.edu.tw/gocwiki/MPI%2e_Cannot_plan%3a_JobAdapterHelper%3a_invalid_value_torque_for_attribute_lrms_type
The one you want is:
edg-job-submit --vo <VO_name> -r <CE_name> --lrms pbs myFile.jdl
Also worth trying the test job referenced at the bottom of this page:
http://goc.grid.sinica.edu.tw/gocwiki/MPI_Support_with_Torque
We (csTCDie) have MPI running on our site and are keen for test users, so
I will have a look at enabling the fusion VO if I get a minute later on.
It's good to see that you're trying to run MPI jobs. I'm aware that
support is currently poor but we are working on in and hopefully things
will improve in the next few months.
> Type = "job";
> JobType = "mpich";
> NodeNumber = 2;
> InputSandbox = "cpi.sh";
> Executable = "cpi.sh";
> StdOutput = "std.out";
> StdError = "std.err";
> OutputSandbox = {"std.out","std.err"};
> VirtualOrganisation="fusion";
> RetryCount=7;
>
> I (rather naively) thought our site (EFDA-JET - grid002.jet.efda.org)
> would support MPICH if we
>
> a) Had mpich installed
> b) Had a comon home directory across WNs
> c) had MPICH in our CE_RUNTIMEENV variable.
Yes that should be enough to run MPI jobs but until the resource broker is
updated it will remain awkward to submit through the standard mechanism.
The other problem is that the RB will hardcode a call to mpirun so if
that's not available or not configured then you might have problems.
> This is clearly not the case. Should we be supportting lam
> or openmpi or shouldn't we be contemplating running parallel grid
> jobs at all?
Please don't give up! If you have further queries, don't hesitate to mail me.
Stephen
--
Dr. Stephen Childs,
Research Fellow, EGEE Project, phone: +353-1-8961797
Computer Architecture Group, email: Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland web: http://www.cs.tcd.ie/Stephen.Childs
|