Hello Cal,
LHC Computer Grid - Rollout wrote:
> However, I've run into another snag. When submitting via the
> grid, the standard error and standard output are lost. The
> reason is that the mpiexec streams always go to the PBS
> output streams. For a grid job, only the wrapper output goes
> there; the real job output is redirected to another file.
> The mpiexec man page implies that the output streams should
> go the real streams unless the -nostdout flag is set. I
> don't see any difference in behavior with or without this flag.
In order to let mpiexec handle the standard error and output you need to patch pbs or torque. This patch is already included in the latest version of torque.
When I installed mpiexec no correct patch was available for the version of torque we use so I made a wrapper script that dumps the contents of the pbs output and error files to stdout and stderr. I can send you this script if you like.
Regards,
Fokke
--------
Fokke Dijkstra
High Performance Computing
SARA - Reken- en Netwerkdiensten http://www.sara.nl
Tel. +31 20 592 8004 Fax. +31 20 668 3167
> Fokke Dijkstra wrote:
>> Hello Cal,
>>
>> mpiexec is trying to use the myrinet network, which you probably
>> don't have at your site. You can either call it with the option
>> "-comm p4", or recompile it with the configuration option
> "--with-default-comm=mpich-p4".
>>
>> Also take care that you disable shared memory support (like you seem
>> to have done), because otherwise it won't work either (at least not
>> on SMP nodes). Enabling shared memory support is not very useful
>> within LCG because the default mpich comes without shared memory
>> support and as a static library. Therefore most mpi programs running
>> on your site will not perform shared memory communication.
>>
>> Regards,
>>
>> Fokke Dijkstra
>>
>>
>> LHC Computer Grid - Rollout wrote:
>> > Hello,
>> >
>> > I'm trying to get mpiexec running on my site (LAL). I've >
>> managed to get it compiled and running, but the simpliest > hello
>> world job returns incorrect output. The process just > prints the
>> process' rank and the total number of processes.
>> > For mpirun I get:
>> >
>> > Hello world! from processor 4 out of 5 > Hello world! from
>> processor 2 out of 5 > Hello world! from processor 3 out of 5 >
>> Hello world! from processor 1 out of 5 > Hello world! from processor
>> 0 out of 5 > > Where as for mpiexec I get:
>> >
>> > Hello world! from processor 0 out of 1 > Hello world! from
>> processor 0 out of 1 > Hello world! from processor 0 out of 1 >
>> Hello world! from processor 0 out of 1 > Hello world! from processor
>> 0 out of 1 > > The current number, but not the correct information.
>> If I > use mpiexec with the verbose flag it does seem to be >
>> connecting to and starting the processes on the correct > machines.
>> Any help with this would be appreciated.
>> >
>> > Cal
>> >
>> >
>> > P.S.
>> >
>> > Both MPICH and mpiexec were compiled without shared memory >
>> communications on SMP nodes. >
>> > The code for the job is:
>> >
>> > /* hello.c
>> > *
>> > * Simple "Hello World" program in MPI.
>> > *
>> > */
>> >
>> > #include "mpi.h"
>> > #include <stdio.h>
>> > int main(int argc, char *argv[])
>> > {
>> > int numprocs; /* Number of processors */
>> > int procnum; /* Processor number */
>> > /* Initialize MPI */
>> > MPI_Init(&argc, &argv);
>> > /* Find this processor number */
>> > MPI_Comm_rank(MPI_COMM_WORLD, &procnum);
>> > /* Find the number of processors */
>> > MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
>> > printf ("Hello world! from processor %d out of %d\n", >
>> procnum, numprocs); > /* Shut down MPI */
>> > MPI_Finalize();
>> > return 0;
>> > }
>> >
>> >
>> > The mpiexec verbose output is:
>> >
>> > resolve_exe: prefixing dot to executable: "./MPItest"
>> > node 0: name = grid18.lal.in2p3.fr, mpname = >
>> grid18.lal.in2p3.fr, cpu = 1 node 1: name = > grid18.lal.in2p3.fr,
>> mpname = grid18.lal.in2p3.fr, cpu = 0 > node 2: name =
>> grid17.lal.in2p3.fr, mpname = > grid17.lal.in2p3.fr, cpu = 1 node
>> 3: name = > grid17.lal.in2p3.fr, mpname =
> grid17.lal.in2p3.fr, cpu = 0
>>> node 4: name = grid16.lal.in2p3.fr, mpname = >
>> grid16.lal.in2p3.fr, cpu = 0 Hello world! from processor 0
>> out of 1
>> Hello world! from processor 0 out of 1 Hello world!
>> > from processor 0 out of 1 Hello world! from processor 0 out > of
>> 1 Hello world! from processor 0 out of 1 > wait_one_task_start: evt
>> = 2, task 0 host grid18.lal.in2p3.fr > wait_one_task_start: evt = 3,
>> task 1 host grid18.lal.in2p3.fr > wait_one_task_start: evt = 4, task
>> 2 host grid17.lal.in2p3.fr > wait_one_task_start: evt = 5, task 3
>> host grid17.lal.in2p3.fr > wait_one_task_start: evt = 6, task 4 host
>> grid16.lal.in2p3.fr > All 5 tasks started.
>> > read_gm_startup_ports: waiting for info > wait_tasks: waiting for
>> grid18.lal.in2p3.fr/1 grid18.lal.in2p3.fr/0 > grid17.lal.in2p3.fr/1
>> grid17.lal.in2p3.fr/0 grid16.lal.in2p3.fr/0 > wait_tasks: numspawned
>> = 5, got evt 7 for tid 2 host > grid18.lal.in2p3.fr status 0
>> wait_tasks: waiting for > grid18.lal.in2p3.fr/0
>> grid17.lal.in2p3.fr/1 grid17.lal.in2p3.fr/0 > grid16.lal.in2p3.fr/0
>> wait_tasks: numspawned = 4, got evt 8 for tid 3 > host
>> grid18.lal.in2p3.fr status 0 wait_tasks: waiting for >
>> grid17.lal.in2p3.fr/1 grid17.lal.in2p3.fr/0 grid16.lal.in2p3.fr/0 >
>> wait_tasks: numspawned = 3, got evt 9 for tid 4 host >
>> grid17.lal.in2p3.fr status 0 wait_tasks: waiting for >
>> grid17.lal.in2p3.fr/0 grid16.lal.in2p3.fr/0 wait_tasks:
> numspawned =
>>> 2, got evt 10 for tid 5 host grid17.lal.in2p3.fr status 0
> wait_tasks:
>> > waiting for grid16.lal.in2p3.fr/0
>> > wait_tasks: numspawned = 1, got evt 11 for tid 6 host >
>> grid16.lal.in2p3.fr status 0
>>
>>
>>
|