Print

Print


Great, unless I hear from you again I'll take that to mean we should 
establish what definition we can use consistently. We should anyway, but 
it's nice to have a handle on things. Thanks for the quick feedback!


/Björn


On 03/30/2017 02:26 PM, Chris Richardson wrote:
>
> ​I haven't tested a full run, but this seems to be working - before it 
> would crash immediately after estimating initial noise spectra, now it 
> is successfully running GPU tasks from MPI processes.
>
>
> Many thanks,
>
>
> Chris
>
>
> ------------------------------------------------------------------------
> *From:* Bjoern Forsberg <[log in to unmask]>
> *Sent:* 30 March 2017 13:07
> *To:* Chris Richardson; [log in to unmask]
> *Subject:* Re: [ccpem] MPI error
>
> Hi Chris,
>
>
> I believe it's a result of some missing macros/defineitions for some 
> complex data types in different MPI flavors/versions. I had a case 
> where someone solved this type of issue by changing line 77 of 
> src/macros.h from
>
> #define MY_MPI_COMPLEX MPI_DOUBLE_COMPLEX
>
> to
>
> #define MY_MPI_COMPLEX MPI_C_DOUBLE_COMPLEX
>
>
> Let us know if that works for you too.
>
>
> Cheers,
>
>
> /Björn
>
>
> On 03/30/2017 01:49 PM, Chris Richardson wrote:
>>
>> ​Ernesto,
>>
>>
>> Did you find a solution to your issues?
>>
>>
>> I'm getting the same error when compiling v2.0.5 (Ubuntu 16.04; CUDA 
>> 8.0 compiled at 52; openmpi 2.0.1; 4 x Titan X Pascal).  Compiling 
>> v2.0.3 stable on the same machine in the same way works without error.
>>
>>
>> Regards,
>>
>>
>> Chris
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Collaborative Computational Project in Electron 
>> cryo-Microscopy <[log in to unmask]> on behalf of Ernesto Arias 
>> <[log in to unmask]>
>> *Sent:* 18 March 2017 00:24
>> *To:* [log in to unmask]
>> *Subject:* [ccpem] MPI error
>> Hi,
>>
>> I am having some issues with relion_refine_mpi. I am using relion 
>> v2.0.5 in a machine running Ubuntu 14.04 with CUDA 8.0 and 
>> openmpi-2.0.2. I can run gctf using MPI, but I get an error when I 
>> try to run a 2D or 3D classification.
>>
>> if I run:
>>
>> /mpirun -n 5 `which relion_refine_mpi` --o Class2D/job022/run --i 
>> ./Extract/job008/particles.star --dont_combine_weights_via_disc 
>> --no_parallel_disc_io --preread_images  --pool 10 --ctf  --iter 25 
>> --tau2_fudge 2 --particle_diameter 220 --K 50 --flatten_solvent 
>> --zero_mask  --oversampling 1 --psi_step 12 --offset_range 5 
>> --offset_step 2 --norm --scale --j 1 --gpu ""/
>>
>> I get this error message:
>>
>> /  1: MPI_ERR_TYPE: invalid datatype
>>   1: MPI_ERR_TYPE: invalid datatype
>>   2: MPI_ERR_TYPE: invalid datatype
>>   2: MPI_ERR_TYPE: invalid datatype
>>   3: MPI_ERR_TYPE: invalid datatype
>>   3: MPI_ERR_TYPE: invalid datatype
>>   4: MPI_ERR_TYPE: invalid datatype
>>   4: MPI_ERR_TYPE: invalid datatype
>> terminate called after throwing an instance of 'RelionError'
>> terminate called after throwing an instance of 'RelionError'
>> [ubuntu:05596] *** Process received signal ***
>> terminate called after throwing an instance of 'RelionError'
>> [ubuntu:05597] *** Process received signal ***
>> [ubuntu:05598] *** Process received signal ***
>> [ubuntu:05598] Signal: Aborted (6)
>> [ubuntu:05598] Signal code:  (-6)
>> [ubuntu:05597] Signal: Aborted (6)
>> [ubuntu:05597] Signal code:  (-6)
>> [ubuntu:05596] Signal: Aborted (6)
>> [ubuntu:05596] Signal code:  (-6)
>> [ubuntu:05596] [ 0] 
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fe484847330]
>> [ubuntu:05596] [ 1] [ubuntu:05598] [ 0] [ubuntu:05597] [ 0] 
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f622e526330]
>> [ubuntu:05597] [ 1] 
>> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fe4844a8c37]
>> [ubuntu:05596] [ 2] 
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f7fd2524330]
>> [ubuntu:05598] [ 1] 
>> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f622e187c37]
>> [ubuntu:05597] [ 2] 
>> /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fe4844ac028]
>> [ubuntu:05596] [ 3] 
>> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f7fd2185c37]
>> [ubuntu:05598] [ 2] 
>> /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f622e18b028]
>> [ubuntu:05597] [ 3] 
>> /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f7fd2189028]
>> [ubuntu:05598] [ 3] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7fe484ccb535]
>> [ubuntu:05596] [ 4] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f622e9aa535]
>> [ubuntu:05597] [ 4] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f7fd29a8535]
>> [ubuntu:05598] [ 4] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7fe484cc96d6]
>> [ubuntu:05596] [ 5] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7f622e9a86d6]
>> [ubuntu:05597] [ 5] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7f7fd29a66d6]
>> [ubuntu:05598] [ 5] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7fe484cc9703]
>> [ubuntu:05596] [ 6] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7f7fd29a6703]
>> [ubuntu:05598] [ 6] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7f622e9a8703]
>> [ubuntu:05597] [ 6] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7fe484cc9922]
>> [ubuntu:05596] [ 7] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7f622e9a8922]
>> [ubuntu:05597] [ 7] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7f7fd29a6922]
>> [ubuntu:05598] [ 7] 
>> /home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7fe4854c6656]
>> [ubuntu:05596] *** End of error message ***
>> /home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7f622f1a5656]
>> [ubuntu:05597] *** End of error message ***
>> /home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7f7fd31a3656]
>> [ubuntu:05598] *** End of error message ***
>> terminate called after throwing an instance of 'RelionError'
>> [ubuntu:05595] *** Process received signal ***
>> [ubuntu:05595] Signal: Aborted (6)
>> [ubuntu:05595] Signal code:  (-6)
>> [ubuntu:05595] [ 0] 
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f94519ab330]
>> [ubuntu:05595] [ 1] 
>> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f945160cc37]
>> [ubuntu:05595] [ 2] 
>> /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f9451610028]
>> [ubuntu:05595] [ 3] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155)[0x7f9451e2f535]
>> [ubuntu:05595] [ 4] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e6d6)[0x7f9451e2d6d6]
>> [ubuntu:05595] [ 5] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e703)[0x7f9451e2d703]
>> [ubuntu:05595] [ 6] 
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e922)[0x7f9451e2d922]
>> [ubuntu:05595] [ 7] 
>> /home/ernesto/programs/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x136)[0x7f945262a656]
>> [ubuntu:05595] *** End of error message ***
>> /
>>
>>
>> Does anybody know what could be the issue?
>>
>> Thank you in advance for the help,
>> Ernesto.
>>
>>
>> The Institute of Cancer Research: Royal Cancer Hospital, a charitable 
>> Company Limited by Guarantee, Registered in England under Company No. 
>> 534147 with its Registered Office at 123 Old Brompton Road, London 
>> SW7 3RP.
>>
>> This e-mail message is confidential and for use by the addressee 
>> only. If the message is received by anyone other than the addressee, 
>> please return the message to the sender by replying to it and then 
>> delete the message from your computer and network.
>
>
> The Institute of Cancer Research: Royal Cancer Hospital, a charitable 
> Company Limited by Guarantee, Registered in England under Company No. 
> 534147 with its Registered Office at 123 Old Brompton Road, London SW7 
> 3RP.
>
> This e-mail message is confidential and for use by the addressee only. 
> If the message is received by anyone other than the addressee, please 
> return the message to the sender by replying to it and then delete the 
> message from your computer and network.