Print

Print


Eugene,

 

I’ve had a similar problem with OpenMPI on Ubuntu in the past.  It seems to be due to some odd incompatibilities between datatype definitions in src/macros.h in the Relion code and in OpenMPI.

 

There’s a solution that worked for me here: https://github.com/3dem/relion/issues/234

 

Maybe it will be fixed in 3.0.

 

Regards,

 

Chris

-- 

Dr Chris Richardson :: Sysadmin, structural biology, icr.ac.uk

 

 

 

 

On 30/07/2018, 11:42, "Collaborative Computational Project in Electron cryo-Microscopy on behalf of Eugene Valkov" <[log in to unmask] on behalf of [log in to unmask]> wrote:

 

Dear all,

 

We are in the process of testing a new workstation for cryoEM work.

 

Hardware:

AMD Ryzen Threadripper 1950X, 16-core 3.40GHz

4 x GeForce GTX 1080 Ti

128Gb (8 x 16Gb) DDR4 2666MHz

250Gb Samsung SSD

 

Software:

Ubuntu 18.04.1 LTS 64-bit

Open MPI 3.1.1

CUDA 9.2

RELION 2.1 (git version)

 

Using the Plasmodium ribosome dataset RELION reproducibly terminates after completing the first step (estimating initial noise spectra). The error output is at the end of this message.

 

The following command line arguments were used:

 

mpirun -n 5 `which relion_refine_mpi` --i Particles/shiny_2sets.star --o Class3D/ --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --gpu 0:1:2:3 --pool 100 --dont_combine_weights_via_disc --j 6

 

 

This seems to be something to do with our RELION/MPI configuration, because if I use relion_refine executable instead of relion_refine_mpi processing proceeds to the next iteration step.

 

I tried to downgrade to Open MPI v1.10, and recompiled RELION but this still gives the same error. I cannot downgrade to CUDA 8.0, which is what I have as a working setup on an older workstation.

 

Any thoughts/suggestions will be much appreciated.

 

Thanks,

Eugene

 

-------------------

Eugene Valkov, Ph.D.

Project Leader

 

Department of Biochemistry

Max Planck Institute for Developmental Biology

Max-Planck-Ring 5

72076 Tübingen

GERMANY

 

Phone:  +49 7071 601 1357

Email: [log in to unmask]

Web: https://www.eb.tuebingen.mpg.de/biochemistry/eugene-valkov/

 

 

 

 === RELION MPI setup ===

 + Number of MPI processes             = 5

 + Number of threads per MPI process  = 6

 + Total number of threads therefore  = 30

 + Master  (0) runs on host            = em-cryo-pc1

 + Slave     1 runs on host            = em-cryo-pc1

 + Slave     2 runs on host            = em-cryo-pc1

 + Slave     3 runs on host            = em-cryo-pc1

 + Slave     4 runs on host            = em-cryo-pc1

 =================

 uniqueHost em-cryo-pc1 has 4 ranks.

 Slave 1 will distribute threads over devices  0

 Thread 0 on slave 1 mapped to device 0

 Thread 1 on slave 1 mapped to device 0

 Thread 2 on slave 1 mapped to device 0

 Thread 3 on slave 1 mapped to device 0

 Thread 4 on slave 1 mapped to device 0

 Thread 5 on slave 1 mapped to device 0

 Slave 2 will distribute threads over devices  1

 Thread 0 on slave 2 mapped to device 1

 Thread 1 on slave 2 mapped to device 1

 Thread 2 on slave 2 mapped to device 1

 Thread 3 on slave 2 mapped to device 1

 Thread 4 on slave 2 mapped to device 1

 Thread 5 on slave 2 mapped to device 1

 Slave 3 will distribute threads over devices  2

 Thread 0 on slave 3 mapped to device 2

 Thread 1 on slave 3 mapped to device 2

 Thread 2 on slave 3 mapped to device 2

 Thread 3 on slave 3 mapped to device 2

 Thread 4 on slave 3 mapped to device 2

 Thread 5 on slave 3 mapped to device 2

 Slave 4 will distribute threads over devices  3

 Thread 0 on slave 4 mapped to device 3

 Thread 1 on slave 4 mapped to device 3

 Thread 2 on slave 4 mapped to device 3

 Thread 3 on slave 4 mapped to device 3

 Thread 4 on slave 4 mapped to device 3

 Thread 5 on slave 4 mapped to device 3

 Running CPU instructions in double precision.

 Estimating initial noise spectra

1.82/1.82 min ............................................................~~(,_,">

  1: MPI_ERR_TYPE: invalid datatype

  1: MPI_ERR_TYPE: invalid datatype

terminate called after throwing an instance of 'RelionError'

[em-cryo-pc1:09935] *** Process received signal ***

[em-cryo-pc1:09935] Signal: Aborted (6)

[em-cryo-pc1:09935] Signal code:  (-6)

  2: MPI_ERR_TYPE: invalid datatype

  2: MPI_ERR_TYPE: invalid datatype

terminate called after throwing an instance of 'RelionError'

[em-cryo-pc1:09936] *** Process received signal ***

[em-cryo-pc1:09936] Signal: Aborted (6)

[em-cryo-pc1:09936] Signal code:  (-6)

[em-cryo-pc1:09936] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f7cf9f44890]

[em-cryo-pc1:09936] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f7cf95d9e97]

[em-cryo-pc1:09936] [ 2]   3: MPI_ERR_TYPE: invalid datatype

  3: MPI_ERR_TYPE: invalid datatype

terminate called after throwing an instance of 'RelionError'

[em-cryo-pc1:09937] *** Process received signal ***

[em-cryo-pc1:09937] Signal: Aborted (6)

[em-cryo-pc1:09937] Signal code:  (-6)

[em-cryo-pc1:09937] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fdd03ddc890]

[em-cryo-pc1:09937] [ 1]   4: MPI_ERR_TYPE: invalid datatype

  4: MPI_ERR_TYPE: invalid datatype

terminate called after throwing an instance of 'RelionError'

[em-cryo-pc1:09938] *** Process received signal ***

[em-cryo-pc1:09938] Signal: Aborted (6)

[em-cryo-pc1:09938] Signal code:  (-6)

[em-cryo-pc1:09938] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fcea4fc3890]

[em-cryo-pc1:09938] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fcea4658e97]

[em-cryo-pc1:09938] [ 2] [em-cryo-pc1:09935] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f3f231cd890]

[em-cryo-pc1:09935] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3f22862e97]

[em-cryo-pc1:09935] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f7cf95db801]

[em-cryo-pc1:09936] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7f7cf9c308fb]

[em-cryo-pc1:09936] [ 4] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fdd03471e97]

[em-cryo-pc1:09937] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fdd03473801]

[em-cryo-pc1:09937] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fcea465a801]

[em-cryo-pc1:09938] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7fcea4caf8fb]

[em-cryo-pc1:09938] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7fcea4cb5d3a]

[em-cryo-pc1:09938] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7fdd03ac88fb]

[em-cryo-pc1:09937] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7fdd03aced3a]

[em-cryo-pc1:09937] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3f22864801]

[em-cryo-pc1:09935] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7f3f22eb98fb]

[em-cryo-pc1:09935] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7f7cf9c36d3a]

[em-cryo-pc1:09936] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7f7cf9c36d95]

[em-cryo-pc1:09936] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7fcea4cb5d95]

[em-cryo-pc1:09938] [ 6] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7fdd03aced95]

/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7f3f22ebfd3a]

[em-cryo-pc1:09935] [ 5] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7f3f22ebfd95]

[em-cryo-pc1:09935] [ 6] [em-cryo-pc1:09937] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7f7cf9c36fe8]

[em-cryo-pc1:09936] [ 7] /soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7f7cfa66edee]

/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7fdd03acefe8]

[em-cryo-pc1:09937] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7fcea4cb5fe8]

[em-cryo-pc1:09938] [ 7] /soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7fcea56eddee]

[em-cryo-pc1:09938] *** End of error message ***

[em-cryo-pc1:09936] *** End of error message ***

/soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7fdd04506dee]

[em-cryo-pc1:09937] *** End of error message ***

/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7f3f22ebffe8]

[em-cryo-pc1:09935] [ 7] /soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7f3f238f7dee]

[em-cryo-pc1:09935] *** End of error message ***

-------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

-------------------------------------------------------

--------------------------------------------------------------------------

mpirun noticed that process rank 4 with PID 0 on node em-cryo-pc1 exited on signal 6 (Aborted).

--------------------------------------------------------------------------

 


To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1


The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.

This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network.


To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1