On Wed, Jan 19, 2005 at 02:09:28am +0100, [log in to unmask] wrote:
> On Tue, 18 Jan 2005, Vangelis Koukis wrote:
>
> > Hello all,
> >
> > I have been experimenting with support for MPICH type jobs under LCG
> > 2.3.0. We have been publishing the MPICH tag for HG-01-GRNET, and
> > job-list-match lists our CE in the candidates for execution, when a JDL
> > with JobType="MPICH" is provided.
> >
> > However, job submission fails with:
> >
> > *************************************************************
> > BOOKKEEPING INFORMATION:
> >
> > Status info for the Job : https://lxn1188.cern.ch:9000/3XP1kH4KzLDepHEbkgWhxg
> > Current Status: Aborted
> > Status Reason: Cannot plan: JobAdapterHelper: invalid value torque for
> > attribute lrms_type (expecting lsf or pbs)
> > reached on: Tue Jan 18 13:56:47 2005
> > *************************************************************
>
> You have set up the "lcgpbs" job manager, which does *not* support MPI
> (it is on the to-do list, non-trivial).
>
After quite a lot of experimentation, we seem to have MPICH support, on
a YAIM and SL3-based LCG 2.3.0 install, with Torque.
1) Passwordless ssh from WN to WN has been setup:`
HostBasedAuthentication must be enabled in /etc/ssh/sshd_config of every
WN, the host keys of all WNs must be added in /etc/ssh/ssh_known_hosts
and /etc/ssh/shosts.equiv must contain the names of the WNs. No password
should be needed for a pool account user to ssh from a WN to another,
not even a confirmation of its host key signature, provided that the
full names have been used in all files.
Also, it seems that it is not necessary to allow passwordless login from
the CE to the WNs, since mpirun is executed on one of the WNs
originally by pbs_mom, and then spawns all of the other processes.
2) The version of Torque included with LCG 2.3.0 and installed by yaim
seems to have native support for MPICH. When a JDL defining a job of type
"MPICH" is submitted, pbs_mom on one of the WNs will exec the appropriate
mpirun command, with the correct "-np" argument depending on the number
of processors, and with a "-machinefile" argument which is compiled on
the fly, based on the WNs that have been selected to execute the job. So,
no extra wrapper scripts that execute mpirun and look at the value of
$PBS_NODEFILE are necessary. (This assumes that /home is shared by all
WNs, otherwise a script may need to be used to copy necessary data to
other WNs). Also, a symbolic link /usr/bin/mpirun -> /opt/mpich/bin/mpirun
must be used, for pbs_mom to find mpirun.
3) SL3 by default installs MPICH in /opt/mpich, but does not provide the
necessary development tools that come with it (mpicc, mpi77 etc). If an
mpicc exists in /usr/bin, it probably belongs to the LAM implementation
of MPI. These compiler wrapper scripts are not necessary in the WNs, but
must be installed in the UI, to facilitate application development. In
our case, we installed MPICH on the UI from source, in /opt/mpich.
4) /opt/lcg/var/gip/lcg-info-generic.conf was modified, to show support
for MPICH, by adding:
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH
5) MPI jobs cannot be submitted using the lnx1188.cern.ch RB, since it
complains about the value of lrms_type (see above error message). However,
submission works properly if the job is submitted directly to a queue
on our CE using '-r', along with the '--lrms torque' parameter of
edg-job-submit:
edg-job-submit --vo dteam -r ce01.isabella.grnet.gr:2119/jobmanager-torque-short --lrms pbs mpi.jdl
6) A final touch: To suppress ssh messages that a host key is being
added to the ~/.ssh/known_hosts of the pool account, so that they do
not appear in the stderr output of MPICH jobs, the line RSHCOMMAND="ssh"
can be changed to RSHCOMMAND="ssh -q" in the file /opt/mpich/bin/mpirun
of every WN.
--
Vangelis Koukis
[log in to unmask]
OpenPGP public key ID:
pub 1024D/1D038E97 2003-07-13 Vangelis Koukis <[log in to unmask]>
Key fingerprint = C5CD E02E 2C78 7C10 8A00 53D8 FBFC 3799 1D03 8E97
|