Hi,
I work with Valérie.
Don’t think the problems comes from the “-n” argument as mpirun will by default automatically execute a copy of the program on each process slot.
This is confirmed by launching this simple hello_mpi [1] :
$ mpirun ./hello_mpi.exe
Hello world from processor node063.cluster.lbt, rank 1 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 9 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 12 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 6 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 3 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 11 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 10 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 14 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 0 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 13 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 2 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 4 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 5 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 7 out of 15 processors
Hello world from processor node063.cluster.lbt, rank 8 out of 15 processors
Relion output confirms that BTW:
=== RELION MPI setup ===
+ Number of MPI processes = 15
+ Number of threads per MPI process = 15
+ Total number of threads therefore = 225
+ Master (0) runs on host = node063.cluster.lbt
+ Slave 1 runs on host = node063.cluster.lbt
+ Slave 2 runs on host = node063.cluster.lbt
+ Slave 3 runs on host = node063.cluster.lbt
+ Slave 4 runs on host = node063.cluster.lbt
+ Slave 5 runs on host = node063.cluster.lbt
+ Slave 6 runs on host = node063.cluster.lbt
+ Slave 7 runs on host = node063.cluster.lbt
+ Slave 8 runs on host = node063.cluster.lbt
=================
+ Slave 9 runs on host = node063.cluster.lbt
+ Slave 10 runs on host = node063.cluster.lbt
+ Slave 11 runs on host = node063.cluster.lbt
+ Slave 12 runs on host = node063.cluster.lbt
+ Slave 13 runs on host = node063.cluster.lbt
+ Slave 14 runs on host = node063.cluster.lbt
uniqueHost node063.cluster.lbt has 14 ranks.
Any idea?
Cheers,
Ben
[1] http://mpitutorial.com/tutorials/mpi-hello-world/
> On 27 Nov 2017, at 22:08, Sjors Scheres <[log in to unmask]> wrote:
>
> Hi Valerie,
> Your submission script does not generate the -n argument of mpirun. You
> can run autopick with the default single process, but you need at least 3
> MPIs for autot-refine. If you used a template qsub script, modify that.
> HTH,
> Sjors
>
>
>> Hello
>>
>> I am new to GPUs and Relion and I have a strange problem that occurs when
>> I send jobs on the GPU cluster.
>> We have relion 2.1b1 with openmpi version 1.6 and I am going through the
>> relion2.1 tutorial dataset.
>> 1) Autopick runs OK with
>> mpirun relion_autopick_mpi --i ./Select/job019/micrographs_selected.star \
>> --ref Select/job015/class_averages.star \
>> --odir AutoPick/job021/ \
>> --pickname autopick \
>> --invert --ctf --ang 5 --shrink 0 --lowpass 20 \
>> --threshold 0.5 \
>> --min_distance 100 \
>> --max_stddev_noise 1.1 \
>> --write_fom_maps\
>> --gpu ""
>>
>> 2) but Refine3D does not accept the —gpu command.
>> mpirun relion_refine_mpi --o Refine3D/job045/run \
>> --auto_refine --split_random_halves \
>> --i ./Select/job043/particles.star \
>> --ref Class3D/job041/run_it025_class001.mrc \
>> --firstiter_cc --ini_high 50 \
>> --dont_combine_weights_via_disc \
>> --preread_images --pool 3 --ctf \
>> --particle_diameter 200 --flatten_solvent \
>> --zero_mask --oversampling 1 --healpix_order 2 \
>> --auto_local_healpix_order 4 --offset_range 5 \
>> --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale --j 15
>> --gpu ""
>>
>> It produces a run.err file that contains
>> mpirun has exited due to process rank 0 with PID 8829 on
>> node node063.cluster.lbt exiting improperly. There are two reasons this
>> could occur:
>>
>> 1. this process did not call "init" before exiting, but others in the job
>> did. This can cause a job to hang indefinitely while it waits for all
>> processes to call "init". By rule, if one process calls "init », then ALL
>> processes must call "init" prior to termination.
>>
>> 2. this process called "init", but exited without calling "finalize ». By
>> rule, all processes that call "init" MUST call "finalize" prior to exiting
>> or it will be considered an "abnormal termination"
>>
>> This may have caused other processes in the application to be terminated
>> by signals sent by mpirun (as reported here).
>>
>>
>> I don’t understand this message. the refine3D runs ok if I remove the
>> —gpu option.
>>
>> thanks for suggestions!
>> best regards,
>> Valerie
>>
>> Valérie Biou [log in to unmask]
>> Laboratoire de Biologie Physico-Chimique des Protéines Membranaires
>> UMR 7099 CNRS/Univ. Paris Diderot P7
>> Institut de Biologie Physico-Chimique
>> 13 rue Pierre et Marie Curie
>> 75005 Paris - France
>> Tel : +33 (0)1 5841 5099
>> Fax : +33 (0)1 5841 5024
>>
>>
>>
>>
>>
>>
>
>
> --
> Sjors Scheres
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue, Cambridge Biomedical Campus
> Cambridge CB2 0QH, U.K.
> tel: +44 (0)1223 267061
> http://www2.mrc-lmb.cam.ac.uk/groups/scheres
|