Hi,

I was trying to do a subtomo 3D refinement with a box size of 68^3 pixels, and I got the following error:

[ai-rmlwort:10089] Signal: Segmentation fault (11)
[ai-rmlwort:10089] Signal code: Address not mapped (1)
[ai-rmlwort:10089] Failing at address: 0x80
[ai-rmlwort:10089] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7fa04c56a7e0]
[ai-rmlwort:10089] [ 1] /usr/mpi/gcc/openmpi-1.8.4/lib64/openmpi/mca_oob_tcp.so(+0xa7c4)[0x7fa04ab9e7c4]
[ai-rmlwort:10089] [ 2] /usr/mpi/gcc/openmpi-1.8.4/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_recv_connect_ack+0x7c)[0x7fa04ab9ed3c]
[ai-rmlwort:10089] [ 3] /usr/mpi/gcc/openmpi-1.8.4/lib64/openmpi/mca_oob_tcp.so(+0x6479)[0x7fa04ab9a479]
[ai-rmlwort:10089] [ 4] /usr/mpi/gcc/openmpi-1.8.4/lib64/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x53c)[0x7fa04d27d03c]
[ai-rmlwort:10089] [ 5] /usr/mpi/gcc/openmpi-1.8.4/lib64/libopen-rte.so.7(orte_daemon+0xec3)[0x7fa04d5249f3]
[ai-rmlwort:10089] [ 6] /usr/mpi/gcc/openmpi-1.8.4/bin/orted(main+0x66)[0x400906]
[ai-rmlwort:10089] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa04c1e5d5d]
[ai-rmlwort:10089] [ 8] /usr/mpi/gcc/openmpi-1.8.4/bin/orted[0x4007d9]
[ai-rmlwort:10089] *** End of error message ***
bash: line 1: 10089 Segmentation fault      (core dumped) /usr/mpi/gcc/openmpi-1.8.4/bin/orted --hnp-topo-sig 2N:2S:2L3:16L2:16L1:16C:32H:x86_64 -mca ess "env" -mca orte_ess_jobid "3478847488" -mca orte_ess_vpid 4 -mca orte_ess_num_procs "6" -mca orte_hnp_uri "3478847488.0;tcp://192.168.110.3,192.168.111.3,192.168.112.3,192.168.113.3,128.231.15.163:54200" --tree-spawn -mca plm "rsh"
[ai-rmlwheat:17826] *** Process received signal ***
[ai-rmlwheat:17826] Signal: Segmentation fault (11)
[ai-rmlwheat:17826] Signal code: Address not mapped (1)
[ai-rmlwheat:17826] Failing at address: 0x80
[ai-rmlwheat:17826] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7fb1f3a9f7e0]
[ai-rmlwheat:17826] [ 1] /usr/mpi/gcc/openmpi-1.8.4/lib64/openmpi/mca_oob_tcp.so(+0xa7c4)[0x7fb1f1ece7c4]
[ai-rmlwheat:17826] [ 2] /usr/mpi/gcc/openmpi-1.8.4/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_recv_connect_ack+0x7c)[0x7fb1f1eced3c]
[ai-rmlwheat:17826] [ 3] /usr/mpi/gcc/openmpi-1.8.4/lib64/openmpi/mca_oob_tcp.so(+0x6479)[0x7fb1f1eca479]
[ai-rmlwheat:17826] [ 4] /usr/mpi/gcc/openmpi-1.8.4/lib64/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x53c)[0x7fb1f47b203c]
[ai-rmlwheat:17826] [ 5] mpirun(orterun+0x1238)[0x404efc]
[ai-rmlwheat:17826] [ 6] mpirun(main+0x20)[0x4038f4]
[ai-rmlwheat:17826] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fb1f371ad5d]
[ai-rmlwheat:17826] [ 8] mpirun[0x403819]
[ai-rmlwheat:17826] *** End of error message ***
Command terminated by signal 11
        Command being timed: "mpirun -hostfile /gs0/home/hansenbry/mpihosts -n 12 /usr/bin/relion_refine_mpi --o Refine3D/run1 --auto_refine --split_random_halves --i subtomo.star --particle_diameter 150 --angpix 7.366 --ini_high 40 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --memory_per_thread 6"
        User time (seconds): 7208072.86
        System time (seconds): 407.09
        Percent of CPU this job got: 1560%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 128:20:00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2052024
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1409546
        Voluntary context switches: 18731729
        Involuntary context switches: 123509266
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


I don't see a reason for the run to have crashed after 128hrs before it even finished the first iteration.  Thanks for any tips on what I did to cause this error.

*********************************

Bryan Hansen

Electron Microscopist

NIH/NIAID/RTB/RML/EMU

Rocky Mountain Laboratories

903 S. 4th St

Hamilton, MT 59840

406.363.9202

240.669.5462