Print

Print


Hi,

Please forgive a somewhat length post, since I'm not sure which of this information may be relevant. We recently updated several RHEL6 systems to 2.1, with no issues. However, I'm having all kinds of problems trying to get 2.1 and specifically mpi working properly on a new quad gpu system running Ubuntu 16.0.4.

1) Is this 2.1 supported on 16.04? I've assumed so, but I notice that on http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Download_%26_install the "Installation using apt-get" still refers to an earlier release. Is this just because no 2.1 package has been created yet, or is there actually some known issue?

If I ignore the apt-get package and build from source, everything seems to go fine. I don't see any obvious errors during configuration, compilation, or installation. But when I try to run using  mpi, things behave badly. E.g. if try to run MotionCor2 with 5 mpi procs and all 4 gpus, I'll get multiple errors like:
ERROR in renaming: MotionCorr/job003/Micrographs/foo_1_Stk.mrc to MotionCorr/job003/Micrographs/foo_1_movie.mrcs
(Okay, the name isn't really "foo". It's long, so I shortened it for clarity). There seem to  be exactly 4 of these errors for each file. I.e. it seems like each of the 5 mpi procs are trying to process every file, rather than splitting up the work, and they're stepping on each other. If I just use 1 mpi rank but tell it to use all 4 gpus things run normally (although nvidia-smi never seems to show any gpu other than the last being used).

Similarly, if I try to run ctffind (latest 4.1.5) with 8 mpi ranks, I get  fatal file errors, e.g.
00:26:05: Error: File 'CtfFind/job003/Micrographs/foo_004.txt' couldn't be removed (error 2: No such file or directory)
and eventually a
ERROR: Unexpected number of words on data line below Columns line in CtfFind/job003/Micrographs/foo_055.txt
File: /home/heumannj/relion/src/ctffind_runner.cpp line: 803

For what it's worth, both openmpi and mpich are installed on this system. Could this all reflect mpi version incompatabilities? I've tried removing mpich, but then cmake seems unable to find the required mpi comple tools.

Sorry, if this is an obvious or known issue. I have very little experience with mpi builds, so I'm sort of stumbling around in the dark.

Thanks in advance for any suggestions!

Regards,
-jh-