

Marco Verlato wrote:

> Kostas Georgakopoulos wrote:
>> We installed the middleware with yaim and specificaly the 
>> lcg-CE-torque and lcg-WN-torque packages on the CE and WN's 
>> respectively. However
>> on the site configuration file we have :
>> JOB_MANAGER=lcgpbs
>> CE_BATCH_SYS=torque
>> the equivalent of what you say should be to change the JOB_MANAGER 
>> and CE_BATCH_SYS to pbs and reconfigure CE and WN's right?
> No, is to change only CE_BATCH_SYS to pbs and reconfigure the CE only.
> Doing this way, the INFN solution described at 
> works also with 
> non-shared home directories,

And the normal jobs  (non mpi jobs) will still work? because Charles 
Loomis made that point exactly: if you change the configuration to pbs 
and you *don't*
have shared home directories then all jobs will fail.

> once in all WNs and CE:
> 1. The file /etc/ssh/sshd_config must contain at least the following 
> lines:
> HostbasedAuthentication yes
> IgnoreUserKnownHosts yes
> IgnoreRhosts yes
> 2. The file /etc/ssh/ssh_config must contain in the section Host* the 
> foollowing line:
> HostbasedAuthentication yes
> 3. The file /etc/ssh_known_hosts2 must contain the public key of all 
> the WN and the
>     CE in the site and must be replicated on every computer.
> 4. The file /etc/ssh/shosts.equiv must contain the list of the 
> hostanames of WNs and CE
> 5. ssh daemon has to be restarted:     /sbin/service sshd restart
> and the script will copy all job subdirectory from the WN where the 
> job is executed to all the others in the set choosen for the job.
> best regards,
> Marco
>> best regards,
>> Kostas Georgakopoulos - University of Macedonia
>> Marco Verlato wrote:
>>> Hi Kostas,
>>> if can help, in the Italian Grid we found that MPI didn't work for 
>>> torque if the CE GRIS published GlueCEInfoLRMSType=torque as is in 
>>> your case for the CE. After putting 
>>> GlueCEInfoLRMSType=pbs our MPI implementation 
>>> (  worked.
>>> best regards,
>>> Marco
>>> Kostas Georgakopoulos wrote:
>>>>  Hi all,
>>>>  i configured our site (GR-02-UoM) for mpi support following the 
>>>> instructions in 
>>>> (torque is the job manager for us) and it seems that everything is 
>>>> ok. However i tried executing the test job from 
>>>> and the job get stuck in one of the workers till the proxy 
>>>> certificate expires. The command used to submit the job was:
>>>> edg-job-submit --vo dteam --lrms pbs -r 
>>>> MPItest.jdl
>>>> has anyone have any idea what the problem might be? (i include the 
>>>> files below).
>>>> Best regards
>>>> Kostas Georgakopoulos
>>>> University of Macedonia
>>>> MPItest.jdl:
>>>> Type = "Job";
>>>> JobType = "MPICH";
>>>> NodeNumber = 8;
>>>> Executable = "";
>>>> Arguments = "MPItest";
>>>> StdOutput = "test.out";
>>>> StdError = "test.err";
>>>> InputSandbox = {"","MPItest.c"};
>>>> OutputSandbox = {"test.err","test.out","mpiexec.out"};
>>>> #!/bin/sh -x
>>>> # the binary to execute
>>>> EXE=$1
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo "Running on: $HOSTNAME"
>>>> echo "As:       " `whoami`
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo "Compiling binary: $EXE"
>>>> echo mpicc -o ${EXE} ${EXE}.c
>>>> mpicc -o ${EXE} ${EXE}.c
>>>> echo "*************************************"
>>>> if [ "x$PBS_NODEFILE" != "x" ] ; then
>>>> echo "PBS Nodefile: $PBS_NODEFILE"
>>>> fi
>>>> if [ "x$LSB_HOSTS" != "x" ] ; then
>>>> echo "LSF Hosts: $LSB_HOSTS"
>>>> HOST_NODEFILE=`pwd`/lsf_nodefile.$$
>>>> for host in ${LSB_HOSTS}
>>>> do
>>>>   echo $host >> ${HOST_NODEFILE}
>>>> done
>>>> fi
>>>> if [ "x$HOST_NODEFILE" = "x" ]; then
>>>> echo "No hosts file defined.  Exiting..."
>>>> exit
>>>> fi
>>>> echo 
>>>> "***********************************************************************" 
>>>> CPU_NEEDED=`cat $HOST_NODEFILE | wc -l`
>>>> echo "Node count: $CPU_NEEDED"
>>>> echo "Nodes in $HOST_NODEFILE: "
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo 
>>>> "***********************************************************************" 
>>>> CPU_NEEDED=`cat $HOST_NODEFILE | wc -l`
>>>> echo "Checking ssh for each node:"
>>>> for host in ${NODES}
>>>> do
>>>> echo "Checking $host..."
>>>> ssh $host hostname
>>>> done
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo "Executing $EXE with mpiexec"
>>>> chmod 755 $EXE
>>>> mpiexec `pwd`/$EXE > mpiexec.out 2>&1
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo 
>>>> "***********************************************************************" 
>>>> echo "Executing $EXE with mpirun"
>>>> chmod 755 $EXE
>>>> mpirun -np $CPU_NEEDED -machinefile $HOST_NODEFILE `pwd`/$EXE
>>>> echo 
>>>> "***********************************************************************" 
>>>> MPItest.c:
>>>> /*  hello.c
>>>> *
>>>> *  Simple "Hello World" program in MPI.
>>>> *
>>>> */
>>>> #include "mpi.h"
>>>> #include <stdio.h>
>>>> int main(int argc, char *argv[])
>>>> {
>>>> int numprocs;  /* Number of processors */
>>>> int procnum;   /* Processor number */
>>>> /* Initialize MPI */
>>>> MPI_Init(&argc, &argv);
>>>> /* Find this processor number */
>>>> MPI_Comm_rank(MPI_COMM_WORLD, &procnum);
>>>> /* Find the number of processors */
>>>> MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
>>>> printf ("Hello world! from processor %d out of %d\n", procnum, 
>>>> numprocs);
>>>> /* Shut down MPI */
>>>> MPI_Finalize();
>>>> return 0;
>>>> }