On Tue, 18 Jan 2005, Vangelis Koukis wrote:
> Hello all,
>
> I have been experimenting with support for MPICH type jobs under LCG
> 2.3.0. We have been publishing the MPICH tag for HG-01-GRNET, and
> job-list-match lists our CE in the candidates for execution, when a JDL
> with JobType="MPICH" is provided.
>
> However, job submission fails with:
>
> *************************************************************
> BOOKKEEPING INFORMATION:
>
> Status info for the Job : https://lxn1188.cern.ch:9000/3XP1kH4KzLDepHEbkgWhxg
> Current Status: Aborted
> Status Reason: Cannot plan: JobAdapterHelper: invalid value torque for
> attribute lrms_type (expecting lsf or pbs)
> reached on: Tue Jan 18 13:56:47 2005
> *************************************************************
You have set up the "lcgpbs" job manager, which does *not* support MPI
(it is on the to-do list, non-trivial).
You can only use MPI with the standard "pbs" job manager, which you could
set up in parallel, with a bit of manual work (note it needs the CE and WNs
to share home directories).
> which seems to be a result of LCG 2.3.0 using Torque instead of PBS.
> The error message stays the same, when trying to execute the same job on
> other CEs advertising MPICH execution capability (by specifying them
> explicitly in the JDL).
>
> Also, trying to compare the available options for integrating MPICH
> support with PBS/Torque, I came across the following link:
>
> http://www.beowulf.org/archive/2005-January/011535.html
>
> which essentially describes mpiexec as a much better alternative compared
> to mpirun for spawing application instances across worker nodes managed
> by PBS. It uses PBS directly to start them, instead of rsh/ssh, thus
> allowing for better monitoring and resource accounting. Does anyone have
> experience with that kind of configuration?
Try to experiment with it and let us know the results; we might use it in
future releases.
|