Print

Print


​I haven't tested a full run, but this seems to be working - before it would crash immediately after estimating initial noise spectra, now it is successfully running GPU tasks from MPI processes.


Many thanks,


Chris



From: Bjoern Forsberg <[log in to unmask]>
Sent: 30 March 2017 13:07
To: Chris Richardson; [log in to unmask]
Subject: Re: [ccpem] MPI error
 

Hi Chris,


I believe it's a result of some missing macros/defineitions for some complex data types in different MPI flavors/versions. I had a case where someone solved this type of issue by changing line 77 of src/macros.h from

#define MY_MPI_COMPLEX MPI_DOUBLE_COMPLEX

to

#define MY_MPI_COMPLEX MPI_C_DOUBLE_COMPLEX


Let us know if that works for you too.


Cheers,


/Björn


On 03/30/2017 01:49 PM, Chris Richardson wrote:

​Ernesto,


Did you find a solution to your issues?


I'm getting the same error when compiling v2.0.5 (Ubuntu 16.04; CUDA 8.0 compiled at 52; openmpi 2.0.1; 4 x Titan X Pascal).  Compiling v2.0.3 stable on the same machine in the same way works without error.


Regards,


Chris



From: Collaborative Computational Project in Electron cryo-Microscopy <[log in to unmask]> on behalf of Ernesto Arias <[log in to unmask]>
Sent: 18 March 2017 00:24
To: [log in to unmask]
Subject: [ccpem] MPI error
 
Hi,

I am having some issues with relion_refine_mpi. I am using relion v2.0.5 in a machine running Ubuntu 14.04 with CUDA 8.0 and openmpi-2.0.2. I can run gctf using MPI, but I get an error when I try to run a 2D or 3D classification.

if I run:

mpirun -n 5 `which relion_refine_mpi` --o Class2D/job022/run --i ./Extract/job008/particles.star --dont_combine_weights_via_disc --no_parallel_disc_io --preread_images  --pool 10 --ctf  --iter 25 --tau2_fudge 2 --particle_diameter 220 --K 50 --flatten_solvent  --zero_mask  --oversampling 1 --psi_step 12 --offset_range 5 --offset_step 2 --norm --scale  --j 1 --gpu ""

I get this error message:

  1: MPI_ERR_TYPE: invalid datatype
  1: MPI_ERR_TYPE: invalid datatype
  2: MPI_ERR_TYPE: invalid datatype
  2: MPI_ERR_TYPE: invalid datatype
  3: MPI_ERR_TYPE: invalid datatype
  3: MPI_ERR_TYPE: invalid datatype
  4: MPI_ERR_TYPE: invalid datatype
  4: MPI_ERR_TYPE: invalid datatype
terminate called after throwing an instance of 'RelionError'
terminate called after throwing an instance of 'RelionError'
[ubuntu:05596] *** Process received signal ***
terminate called after throwing an instance of 'RelionError'
[ubuntu:05597] *** Process received signal ***
[ubuntu:05598] *** Process received signal ***
[ubuntu:05598] Signal: Aborted (6)
[ubuntu:05598] Signal code:  (-6)
[ubuntu:05597] Signal: Aborted (6)
[ubuntu:05597] Signal code:  (-6)
[ubuntu:05596] Signal: Aborted (6)
[ubuntu:05596] Signal code:  (-6)
[ubuntu:05596] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fe484847330]
[ubuntu:05596] [ 1] [ubuntu:05598] [ 0] [ubuntu:05597] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f622e526330]
[ubuntu:05597] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fe4844a8c37]
[ubuntu:05596] [ 2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f7fd2524330]
[ubuntu:05598] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f622e187c37]
[ubuntu:05597] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fe4844ac028]
[ubuntu:05596] [ 3] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f7fd2185c37]
[ubuntu:05598] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f622e18b028]
[ubuntu:05597] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f7fd2189028]
[ubuntu:05598] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7fe484ccb535]
[ubuntu:05596] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f622e9aa535]
[ubuntu:05597] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f7fd29a8535]
[ubuntu:05598] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7fe484cc96d6]
[ubuntu:05596] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7f622e9a86d6]
[ubuntu:05597] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7f7fd29a66d6]
[ubuntu:05598] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7fe484cc9703]
[ubuntu:05596] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7f7fd29a6703]
[ubuntu:05598] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7f622e9a8703]
[ubuntu:05597] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7fe484cc9922]
[ubuntu:05596] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7f622e9a8922]
[ubuntu:05597] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7f7fd29a6922]
[ubuntu:05598] [ 7] /home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7fe4854c6656]
[ubuntu:05596] *** End of error message ***
/home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7f622f1a5656]
[ubuntu:05597] *** End of error message ***
/home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7f7fd31a3656]
[ubuntu:05598] *** End of error message ***
terminate called after throwing an instance of 'RelionError'
[ubuntu:05595] *** Process received signal ***
[ubuntu:05595] Signal: Aborted (6)
[ubuntu:05595] Signal code:  (-6)
[ubuntu:05595] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f94519ab330]
[ubuntu:05595] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f945160cc37]
[ubuntu:05595] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f9451610028]
[ubuntu:05595] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f9451e2f535]
[ubuntu:05595] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7f9451e2d6d6]
[ubuntu:05595] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7f9451e2d703]
[ubuntu:05595] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7f9451e2d922]
[ubuntu:05595] [ 7] /home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7f945262a656]
[ubuntu:05595] *** End of error message ***


Does anybody know what could be the issue?

Thank you in advance for the help,
Ernesto.


The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network.


The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network.