On Tue, Mar 08, 2005 at 07:36:26pm +0100, Charles Loomis wrote:
> Hello,
>
> I've managed to get a reasonable configuration of torque, maui, mpich,
> and mpiexec working at LAL. The configuration fixes the problem you
> describe below and also adds support for mpiexec. It uses newer
> versions of torque and mpich than are standard in the LCG/EGEE
> distribution. The newer versions are running at LAL and work well.
>
Hello All,
I've been following Cal's very good instructions from
http://goc.grid.sinica.edu.tw/gocwiki/MPI_Support_with_Torque
in order to get a working Torque/mpiexec setup on our site, HG-01-GRNET,
under LCG 2.4.0. Everything seems to be working OK, however there are
a few points which I don't think I have understood completely regarding
the way the middleware interacts with MPICH.
I've installed the latest version of Torque, 1.2.0p3, which seems to
work without any problems, based on Cal's original source RPMs. The
changes needed were trivial, and the TMPDIR patch applies cleanly to
this version. The resulting files can be found at:
http://www.cslab.ece.ntua.gr/~vkoukis/torque-mpi
I've also setup the submit filter and changed the LRMS type to 'pbs', as
suggested in the Wiki, in order to work around the problem with the RB
not liking the 'torque' LRMSType.
MPICH Job submission works without problems, when using a JDL like:
Type = "job";
JobType = "MPICH";
NodeNumber = 16;
Executable = "hello";
#Arguments ="a1";
StdOutput = "hello.out";
StdError = "hello.err";
InputSandbox = {"hello"};
OutputSandbox = {"hello.out","hello.err"};
Requirements = other.GlueCEUniqueID == "ce01.isabella.grnet.gr:2119/jobmanager-pbs-dteam"
and submitting the job with just "edg-job-submit --vo dteam mpi.jdl".
"hello" is the actual executable of the application, produced by
"mpicc -o hello hello.c".
The point is, I am not sure on exactly *who* is responsible for calling
mpirun, in order to spawn the processes of the parallel application.
In the GOC Wiki, it is suggested (there are many other sources of
documentation suggesting something similar, too) using a test job which
declares a shell script as the executable. The shell script then calls
mpirun/mpiexec/whatever, at its discretion, whenever it thinks it's
necessary. However, from what I have seen both using LCG2.3.1 and
LCG2.4.0, mpirun is not meant to be called by a shell script, but is
called directly by Torque.
So, to provide support for mpiexec, I had to replace the MPICH-provided
/usr/bin/mpirun with a very small, very simple wrapper which discards
all mpirun-specific arguments, and then calls mpiexec, as Fokke Dijkstra
suggests in:
http://www.listserv.rl.ac.uk/cgi-bin/webadmin?A2=ind0503&L=LCG-ROLLOUT&P=R45745&I=-3
The way mpirun is inserted in the job description submitted to Torque
can be seen in in /opt/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm on
the CE, which constructs the submission script for Torque/PBS:
if($description->jobtype() eq 'mpi')
{
print JOB "$mpirun -np ", $description->count(), ' ';
if($cluster)
{
print JOB " -machinefile \$PBS_NODEFILE ";
}
print JOB $description->executable(), " $args < ",
$description->stdin(), "\n";
}
It seems that the user has no control over whether mpirun is called
or not. In fact, when I tried submitting a JDL with Cal's MPItest.sh as
the executable, the small /usr/bin/mpirun wrapper was still called.
This means that MPItest.sh gets to be used as the parallel application,
which is wrong... Depending on the platform-specific way MPICH uses to
spawn its processes, this could lead to multiple instances of it running
around the cluster.
So, the question is: As our installation is now, a user can submit
an MPICH executable for execution, without doing anything special, and
get back its results. Is this the usual way in which MPICH jobs are
submitted, or is it necessary that the script-based methods are
supported? And if yes, exactly how can this be done?
Sorry for the length of my e-mail,
Best Regards,
Vangelis.
--
Vangelis Koukis, PhD candidate
Computing Systems Laboratory,
National Technical University of Athens.
Institute of Communication and Computer Systems (ICCS)
[log in to unmask]
|