Hi Dimitris, Emanouil and others,
LHC Computer Grid - Rollout wrote:
> Hi ,
>
> I haven`t tried your progam yet , but adding to Emanouil`s
> post about using mpi with non shared home directories you can point
> your user at:
>
> http://grid-it.cnaf.infn.it/fileadmin/sysadm/mpi-support/MPInotes.txt
>
> In my opinion a non shared home directories implementation is
> preferable , if possible , since large sites will have
> performace/scaling problems if lots of worker nodes
> read/write a shared media , depending on the type of jobs of
> course. Perhaps admins of large sites can share their experiences on
> this issue .
>
> Best regards ,
I would like to raise the following points with respect to the INFN solution, and especially the user submitted script:
1. It assumes that a site runs PBS or some derivative.
2. The wrapper script created by the LCG2 middleware already does an mpirun to execute the job script. This may be a problem because mpirun is meant to start up mpi executables. This may lead to problems in the future with new releases of mpich. Furthermore some sites might prefer to use mpiexec instead. (We have installed a wrapper script that replaces mpirun and calls mpiexec).
BTW this leads to a more general problem with the way the mpi support is implemented. At the moment the mpirun command is inserted in the job script created by the middleware. When a user wants to run a script to do some extra work before or after his executable runs, the only way to do this is by submitting a script as executable. At the moment this works for mpich mpirun, not sure if it works on other MPI implementations or if it will keep working in the future.
3. Problems like this should probably be solved in the middleware and not by user scripts. As far as I can see the mpich support in LCG2 is implemented in a simple fashion and could be improved by e.g. implementing the described rsync operation for non-shared home directories.
With respect to problem of I/O load on a shared home directory, we move all single cpu jobs to a local /scratch filesystem using the transient TMPDIR patch for torque by David Groep. This works quite well on our site.
Regards,
Fokke Dijkstra
--------
Fokke Dijkstra
High Performance Computing
SARA - Reken- en Netwerkdiensten http://www.sara.nl
Tel. +31 20 592 8004 Fax. +31 20 668 3167
|