In relion-2, the default (at least in the GUI) will actually change to not
using the disk. We originally went from MPI messages to default disk-I/O
because we had bugs on the network cards of our (now older) cluster.
Recently, we haven't observed those problems anymore and MPI messages seem
to be much faster.
HTH,
Sjors
> Thanks Sjors,
>
> You once again saved the day.
>
> We have now run a bunch of tests and your
> "—dont_combine_weights_via_disc” setting did the trick.
>
> We are looking so forward to getting our new parallel cluster file system,
> but until then we will use this setting for multi node RELION jobs.
>
> Thank you so much again.
>
> //Jesper
>
> ------------------------------------
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Gustav Wieds Vej 10C
> 8000 Aarhus C
>
> E-mail: [log in to unmask]<mailto:[log in to unmask]>
> Tlf: 50906203
>
> On 22 Jun 2016, at 14:53, Sjors Scheres
> <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>
> If disk access is a problem somehow (and it seems it is), one can also
> try
> --dont_combine_weights_via_disc. That will use MPI messages instead of
> writing to disk.
> HTH,
> Sjors
> hi,
>
> could also be a quota, even if it has enough space with "df".
>
> things i would check now:
> - are all nodes & storage in time sync by ntp?
> - do all nodes have the same mount/cache options (e.g. for nfs when the
> client closes the file and commits to the fileserver).
> - are all nodes from the same architecture (we sometime used relion in
> parallel on nodes with different cpu speeds and never had this error)?
> - is it always the same node and same file which has the problem?
>
> cheers,
> wolfgang
>
>
>
> On 06/22/2016 01:55 PM, Ludovic Renault wrote:
> Have you ran out of space?
> I've seen this type of errors when hard drives are full.
> Seems too easy but that would be my first thought.
>
> Ludo
>
> On Wed, Jun 22, 2016 at 1:41 PM, Jesper Lykkegaard Karlsen
> <[log in to unmask]<mailto:[log in to unmask]> <mailto:[log in to unmask]>> wrote:
>
> Hi all,
>
> I have been struggling with a weird RELION problem on our CPU
> cluster (CentOS 7).
>
> The cluster uses the SLURM queuing system (v.15.08.10).
>
> When running RELION 3D_refine jobs over multiple nodes I get this
> error after a few iterations:
>
> MultidimArray::read: File Class3D/run3_rank000001.tmp not found
> File: ./src/multidim_array.h line: 3936
>
> Looking in the Class3D folder it is clear that the
> "run3_rank000001.tmp" gets created 26 second later than the rest
> of the tmp files.
>
> $ ls -rt | grep .tmp | xargs stat -c '%n %y' | awk '{print
> $1,$3}' | tail -n 10
> run3_rank000025.tmp 10:47:12.095859492
> run3_rank000020.tmp 10:47:12.095859492
> run3_rank000002.tmp 10:47:12.098859492
> run3_rank000027.tmp 10:47:12.099859492
> run3_rank000022.tmp 10:47:12.099859492
> run3_rank000005.tmp 10:47:12.099859492
> run3_rank000023.tmp 10:47:12.100859492
> run3_rank000016.tmp 10:47:12.101859492
> run3_rank000019.tmp 10:47:12.102859492
> run3_rank000001.tmp 10:47:38.072860845
>
> I thought at first this could be a configuration issue, but by now
> I have tried a lot of combinations, using different version of
> OpenMPI, compiled with different versions of GCC and I even
> disabled TSO and even GSO on the network cards. Nothing has worked
> so far.
>
> Since it seems consistent that the "run3_rank000001.tmp" gets
> written last and apparently after RELION crashes. I thought I
> might dare and and ask you guys, although this seems like a
> sysadmin related question, or is it possible that RELION could
> have a bug?
>
> Has anyone seen anything similar and if so did you figure out a
> solution?
>
> Cheers,
> Jesper
>
>
>
>
> --
> Universitätsklinikum Hamburg-Eppendorf (UKE)
> @ Centre for Structral Systems Biology (CSSB)
> @ Institute of Molecular Biotechnology (IMBA)
> Dr. Bohr-Gasse 3-7 (Room 6.14)
> 1030 Vienna, Austria
> Tel.: +43 (1) 790 44-4649
> Email: [log in to unmask]<mailto:[log in to unmask]>
> http://www.cssb-hamburg.de/
>
> --
>
> _____________________________________________________________________
>
> Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen
> Rechts; Gerichtsstand: Hamburg | www.uke.de
> Vorstandsmitglieder: Prof. Dr. Burkhard Göke (Vorsitzender), Prof. Dr.
> Dr. Uwe Koch-Gromus, Joachim Prölß, Rainer Schoppik
> _____________________________________________________________________
>
> SAVE PAPER - THINK BEFORE PRINTING
>
>
>
> --
> Sjors Scheres
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue, Cambridge Biomedical Campus
> Cambridge CB2 0QH, U.K.
> tel: +44 (0)1223 267061
> http://www2.mrc-lmb.cam.ac.uk/groups/scheres
>
>
--
Sjors Scheres
MRC Laboratory of Molecular Biology
Francis Crick Avenue, Cambridge Biomedical Campus
Cambridge CB2 0QH, U.K.
tel: +44 (0)1223 267061
http://www2.mrc-lmb.cam.ac.uk/groups/scheres
|