Print

Print


Dear all,

We are in the process of testing a new workstation for cryoEM work.

Hardware:
AMD Ryzen Threadripper 1950X, 16-core 3.40GHz
4 x GeForce GTX 1080 Ti
128Gb (8 x 16Gb) DDR4 2666MHz
250Gb Samsung SSD

Software:
Ubuntu 18.04.1 LTS 64-bit
Open MPI 3.1.1
CUDA 9.2
RELION 2.1 (git version)

Using the Plasmodium ribosome dataset RELION reproducibly terminates after
completing the first step (estimating initial noise spectra). The error
output is at the end of this message.

The following command line arguments were used:

mpirun -n 5 `which relion_refine_mpi` --i Particles/shiny_2sets.star --o
Class3D/ --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf
--ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6
--flatten_solvent --zero_mask --oversampling 1 --healpix_order 2
--offset_range 5 --offset_step 2 --sym C1 --norm --scale --gpu 0:1:2:3
--pool 100 --dont_combine_weights_via_disc --j 6


This seems to be something to do with our RELION/MPI configuration, because
if I use relion_refine executable instead of relion_refine_mpi processing
proceeds to the next iteration step.

I tried to downgrade to Open MPI v1.10, and recompiled RELION but this
still gives the same error. I cannot downgrade to CUDA 8.0, which is what I
have as a working setup on an older workstation.

Any thoughts/suggestions will be much appreciated.

Thanks,
Eugene

-------------------
Eugene Valkov, Ph.D.
Project Leader

Department of Biochemistry
Max Planck Institute for Developmental Biology
Max-Planck-Ring 5
72076 Tübingen
GERMANY

Phone:  +49 7071 601 1357
Email: [log in to unmask]
Web: https://www.eb.tuebingen.mpg.de/biochemistry/eugene-valkov/



 === RELION MPI setup ===
 + Number of MPI processes             = 5
 + Number of threads per MPI process  = 6
 + Total number of threads therefore  = 30
 + Master  (0) runs on host            = em-cryo-pc1
 + Slave     1 runs on host            = em-cryo-pc1
 + Slave     2 runs on host            = em-cryo-pc1
 + Slave     3 runs on host            = em-cryo-pc1
 + Slave     4 runs on host            = em-cryo-pc1
 =================
 uniqueHost em-cryo-pc1 has 4 ranks.
 Slave 1 will distribute threads over devices  0
 Thread 0 on slave 1 mapped to device 0
 Thread 1 on slave 1 mapped to device 0
 Thread 2 on slave 1 mapped to device 0
 Thread 3 on slave 1 mapped to device 0
 Thread 4 on slave 1 mapped to device 0
 Thread 5 on slave 1 mapped to device 0
 Slave 2 will distribute threads over devices  1
 Thread 0 on slave 2 mapped to device 1
 Thread 1 on slave 2 mapped to device 1
 Thread 2 on slave 2 mapped to device 1
 Thread 3 on slave 2 mapped to device 1
 Thread 4 on slave 2 mapped to device 1
 Thread 5 on slave 2 mapped to device 1
 Slave 3 will distribute threads over devices  2
 Thread 0 on slave 3 mapped to device 2
 Thread 1 on slave 3 mapped to device 2
 Thread 2 on slave 3 mapped to device 2
 Thread 3 on slave 3 mapped to device 2
 Thread 4 on slave 3 mapped to device 2
 Thread 5 on slave 3 mapped to device 2
 Slave 4 will distribute threads over devices  3
 Thread 0 on slave 4 mapped to device 3
 Thread 1 on slave 4 mapped to device 3
 Thread 2 on slave 4 mapped to device 3
 Thread 3 on slave 4 mapped to device 3
 Thread 4 on slave 4 mapped to device 3
 Thread 5 on slave 4 mapped to device 3
 Running CPU instructions in double precision.
 Estimating initial noise spectra
1.82/1.82 min
............................................................~~(,_,">
  1: MPI_ERR_TYPE: invalid datatype
  1: MPI_ERR_TYPE: invalid datatype
terminate called after throwing an instance of 'RelionError'
[em-cryo-pc1:09935] *** Process received signal ***
[em-cryo-pc1:09935] Signal: Aborted (6)
[em-cryo-pc1:09935] Signal code:  (-6)
  2: MPI_ERR_TYPE: invalid datatype
  2: MPI_ERR_TYPE: invalid datatype
terminate called after throwing an instance of 'RelionError'
[em-cryo-pc1:09936] *** Process received signal ***
[em-cryo-pc1:09936] Signal: Aborted (6)
[em-cryo-pc1:09936] Signal code:  (-6)
[em-cryo-pc1:09936] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f7cf9f44890]
[em-cryo-pc1:09936] [ 1]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f7cf95d9e97]
[em-cryo-pc1:09936] [ 2]   3: MPI_ERR_TYPE: invalid datatype
  3: MPI_ERR_TYPE: invalid datatype
terminate called after throwing an instance of 'RelionError'
[em-cryo-pc1:09937] *** Process received signal ***
[em-cryo-pc1:09937] Signal: Aborted (6)
[em-cryo-pc1:09937] Signal code:  (-6)
[em-cryo-pc1:09937] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fdd03ddc890]
[em-cryo-pc1:09937] [ 1]   4: MPI_ERR_TYPE: invalid datatype
  4: MPI_ERR_TYPE: invalid datatype
terminate called after throwing an instance of 'RelionError'
[em-cryo-pc1:09938] *** Process received signal ***
[em-cryo-pc1:09938] Signal: Aborted (6)
[em-cryo-pc1:09938] Signal code:  (-6)
[em-cryo-pc1:09938] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fcea4fc3890]
[em-cryo-pc1:09938] [ 1]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fcea4658e97]
[em-cryo-pc1:09938] [ 2] [em-cryo-pc1:09935] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f3f231cd890]
[em-cryo-pc1:09935] [ 1]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f3f22862e97]
[em-cryo-pc1:09935] [ 2]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f7cf95db801]
[em-cryo-pc1:09936] [ 3]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7f7cf9c308fb]
[em-cryo-pc1:09936] [ 4]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fdd03471e97]
[em-cryo-pc1:09937] [ 2]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fdd03473801]
[em-cryo-pc1:09937] [ 3]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fcea465a801]
[em-cryo-pc1:09938] [ 3]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7fcea4caf8fb]
[em-cryo-pc1:09938] [ 4]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7fcea4cb5d3a]
[em-cryo-pc1:09938] [ 5]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7fdd03ac88fb]
[em-cryo-pc1:09937] [ 4]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7fdd03aced3a]
[em-cryo-pc1:09937]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f3f22864801]
[em-cryo-pc1:09935] [ 3]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7f3f22eb98fb]
[em-cryo-pc1:09935] [ 4]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7f7cf9c36d3a]
[em-cryo-pc1:09936] [ 5]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7f7cf9c36d95]
[em-cryo-pc1:09936]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7fcea4cb5d95]
[em-cryo-pc1:09938] [ 6] [ 5]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7fdd03aced95]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7f3f22ebfd3a]
[em-cryo-pc1:09935] [ 5] [ 6]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7f3f22ebfd95]
[em-cryo-pc1:09935] [ 6] [em-cryo-pc1:09937] [ 6]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7f7cf9c36fe8]
[em-cryo-pc1:09936] [ 7]
/soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7f7cfa66edee]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7fdd03acefe8]
[em-cryo-pc1:09937] [ 7]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7fcea4cb5fe8]
[em-cryo-pc1:09938] [ 7]
/soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7fcea56eddee]
[em-cryo-pc1:09938] *** End of error message ***
[em-cryo-pc1:09936] *** End of error message ***
/soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7fdd04506dee]
[em-cryo-pc1:09937] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7f3f22ebffe8]
[em-cryo-pc1:09935] [ 7]
/soft/relion/build/lib/librelion_lib.so(_ZN7MpiNode16report_MPI_ERROREi+0x1ee)[0x7f3f238f7dee]
[em-cryo-pc1:09935] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node em-cryo-pc1 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------

########################################################################

To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1