Hi,
> Are odd numbers of GPUs supported? I remember encountering various
> errors (empty groups in half sets and then zero sum of weights) a few
> weeks ago when attempting to use 3 GPUs for 3d auto-refine. 2 and 4 were
> fine, so I assume it‘s GPU number that was the problem.
First, the number of MPI processes must be an odd number.
The first rank (= process) is master and the same numbers of ranks
for half1 and half2.
When you have 3 GPUs, you have to use 3 MPI processes, not 4.
For example:
rank 0 -- master
rank 1 -- half1, using GPU 0,2
rank 2 -- half2, using GPU 1,2
Thus, one GPU has to handle both half 1 and half 2 (--gpu "0,2:1,2").
This might limit the size of the box you can use.
Best regards,
Takanori Nakane
On 2019/09/14 11:08, Basil Greber wrote:
> Hi Takanori,
>
> Are odd numbers of GPUs supported? I remember encountering various
> errors (empty groups in half sets and then zero sum of weights) a few
> weeks ago when attempting to use 3 GPUs for 3d auto-refine. 2 and 4 were
> fine, so I assume it‘s GPU number that was the problem.
>
> Best,
>
> Basil
>
> Takanori Nakane <[log in to unmask]
> <mailto:[log in to unmask]>> schrieb am Sa. 14. Sep. 2019 um 11:04:
>
> Hi,
>
> You have to use at least 3 MPI processes for Refine3D.
> One master, one for half 1, one for half 2.
>
> Best regards,
>
> Takanori Nakane
>
> On 2019/09/14 9:54, Tian Li wrote:
> > Hi all,
> >
> > I was running relion 3D_refine, but I came across a mpi issue.
> > We have 1GPU on our workstation, we were able to run 2D and 3D
> classification without any issues until 3D refinement.
> >
> > if we used Number of MPI procs: 1, the error message is as
> following:
> >
> > in: /home/dell/relion/src/ml_optimiser.cpp, line 2417
> > === Backtrace ===
> > /usr/local/bin/relion_refine(_ZN11RelionErrorC1ERKSsS1_l+0x41)
> [0x43d6b1]
> > /usr/local/bin/relion_refine(_ZN11MlOptimiser7iterateEv+0x92d)
> [0x48d74d]
> > /usr/local/bin/relion_refine(main+0xb0d) [0x42b08d]
> > /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fab06e22495]
> > /usr/local/bin/relion_refine() [0x42e3cf]
> > ==================
> > ERROR:
> > ERROR: Cannot split data into random halves without using MPI!
> >
> > if we used Number of MPI procs: 2, the error message is as following:
> >
> > [localhost.localdomain:398986] PMIX ERROR: UNPACK-PAST-END in
> file unpack.c at line 206
> > [localhost.localdomain:398986] PMIX ERROR: UNPACK-PAST-END in
> file unpack.c at line 147
> > [localhost.localdomain:398986] PMIX ERROR: UNPACK-PAST-END in
> file client/pmix_client.c at line 225
> > [localhost.localdomain:398986] OPAL ERROR: Error in file
> pmix3x_client.c at line 112
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> > *** and potentially your MPI job)
> > [localhost.localdomain:398986] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate error
> messages, and not able to guarantee that all other processes were
> killed!
> > [localhost.localdomain:398987] PMIX ERROR: UNPACK-PAST-END in
> file unpack.c at line 206
> > [localhost.localdomain:398987] PMIX ERROR: UNPACK-PAST-END in
> file unpack.c at line 147
> > [localhost.localdomain:398987] PMIX ERROR: UNPACK-PAST-END in
> file client/pmix_client.c at line 225
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> > *** and potentially your MPI job)
> > [localhost.localdomain:398987] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate error
> messages, and not able to guarantee that all other processes were
> killed!
> > [localhost.localdomain:398987] OPAL ERROR: Error in file
> pmix3x_client.c at line 112
> >
> >
> > our openmpi version is 4.0.1, I was able to test openmpi
> successfully.
> > I'm wondering if anyone has came across the same issue? if the
> error means something wrong with my openmpi installation?
> >
> > Thanks,
> > Tian
> >
> >
> ########################################################################
> >
> > To unsubscribe from the CCPEM list, click the following link:
> > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
> >
>
> ########################################################################
>
> To unsubscribe from the CCPEM list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
|