Dear all,
I am have encountered a presumably MPI-related problem whilst testing Relion 1.4 with SGI’s version of MPI (MPT 2.09) on an SGI UV cluster. The following error message appears:
“MPT Warning: Could not allocate an internal send buffer in the last 30 seconds on rank 2 at 22:09:49. Try increasing MPI_BUFS_PER_PROC. Alternatively, destination rank 4 on host uv3 may be running slowly.”
Some details:
- The message appears during the 3D refinement of a particular test data set. The job starts well but consistently hangs at the beginning of iteration 14, regardless of how the job is set up with respect to numbers of MPI processes and threads. In all cases the amount of RAM per MPI process should be sufficient.
- Increasing MPI_BUFS_PER_PROC (default = 32) to 128 or 1024 did not solve the problem.
- 2D classification jobs using the same data set finish without error. 2D/3D classification and 3D refinement jobs using other data sets also finish without error.
- The problematic 3D refinement job finishes without error on a 32 core workstation with 64 GB so I do not think that there is a problem with the actual Relion command, files or directories.
Is this message related to the MPISend error discussed here and on the Relion wiki? If so, am I correct in thinking that the src/ml_optimiser_mpi.cpp file needs to be edited and Relion needs to be recompiled?
Many thanks in advance for any help and suggestions,
Rob
|