Hi Mani,
Sounds like you filled your disc with the temporary files that are
written out at the end of each iteration.
HTH, S
On 04/01/2014 12:16 PM, Manikandan KARUPPASAMY wrote:
> Hi all,
>
> We are getting the following error during relion_auto_refine run. It
> happens only after 8th iteration.
>
> The command used:
> $ mpirun -np 188 -machinefile /user/mbotte/.openmpi-farm-hostfile
> `which relion_refine_mpi` --o Refine3D/140312_2/run1 --i
> relion/140303_input/all_images_ori.star --particle_diameter 180
> --angpix 1.36 --ref relion/Refine3D/140305/run1_class001.mrc
> --flatten_solvent --sym C1 --oversampling 1 --auto_refine
> --split_random_halves --low_resol_join_halves 40 --healpix_order 3
> --offset_range 10 --offset_step 2 --auto_local_healpix_order 5 --norm
> --scale --j 1
>
> -------------------
> Expectation iteration 8
> 28.61/28.61 hrs
> ............................................................~~(,_,">
> MultidimArray::write: File Refine3D/140312_2/run1_rank000187.tmp
> cannot be opened for output
> File: ./src/multidim_array.h line: 3945
> --------------------------------------------------------------------------
>
> MPI_ABORT was invoked on rank 187 in communicator MPI_COMM_WORLD
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
>
> [sky50][[24417,1],123][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [sky57][[24417,1],171][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [sky56][[24417,1],155][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
>
> mpirun has exited due to process rank 187 with PID 11357 on
> node sky58 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> ---------------------
>
> Please help us to fix this.
>
> with kind regards
>
> Mani.
--
Sjors Scheres
MRC Laboratory of Molecular Biology
Francis Crick Avenue, Cambridge Biomedical Campus
Cambridge CB2 0QH, U.K.
tel: +44 (0)1223 267061
http://www2.mrc-lmb.cam.ac.uk/groups/scheres
|