Dear Sjors,
We have the possibility to limit the resolution in classification mode with —strict_highres_exp and with —coarse_size.
Is there a way for these settings to be taken into account in the auto_refine mode?
That way we could potentially limit the resolution of the last iteration, for instance in a case where we know that our reconstruction resolution will be quite far from Nyquist.
Amedee
On Jan 19, 2016, at 3:32 AM, Sjors Scheres <[log in to unmask]> wrote:
> Dear Reza,
>
> The larger the box size, the more memory the job will take (in particular
> the last iteration of auto-refine, which is out to Nyquist). The
> FFT-favorable box sizes will only affect the maximization steps in a
> (perhaps) noticeable way, but not to the extent that Leo reports. The
> expectation step doesn't do many FFTs at all, so it's negligible there.
>
> We've gone through this before, but let me re-iterate about the
> threads/nodes/MPIs here, as this is a question that arises with many new
> users. Relion is hybridly parallel, it will use threads for shared-memory
> parallelization and MPI for distributed-memory parallelization. This is to
> tae advantage of modern clusters of multi-core machines. Each node (a
> separate computer, ith its own memory (RAM), that is connected to all the
> other nodes of the cluster through network cables) will have multiple
> cores, i.e. processing units that can be run in parallel. For example, we
> have a bit older 8-core nodes and more modern 12-core nodes on our
> cluster. Probably your laptop will be 2-cores or 4-cores. You can run
> processes in parallel on those multiple cores. They can be MPI processes
> or threads. In RELION both types of parallelizations will process sub-sets
> of the particles. The former do not see each other's memory and send
> messages to each other instead. The latter all see the same shared memory
> within a node. If you have a 8-core node, you may run 8 threads or 8 MPIs
> on it, so they are all busy. Often it may be a bit more efficient to
> over-run them a bit, for exaple by running 10 threads on 8 cores. This is
> because not all processes actually use the core 100% of the time, as they
> also need to read/write from disk etc. Now, the MPI processes don't see
> each other's memory, which means that if you run 8 (or 10) MPIs on a
> 8-core node, you will need to replicate the memory of each process 8
> times. That's where the main advantage of threads lies: by sharing the
> memory you can have the Fourier transforms of the reference (and the sums
> of all backprojected particles) into memory only once for all 8 threads.
> Thereby, you can process much larger references (boxes) than when running
> 8 MPI processes. The main advantage of the MPI implementation is
> scalability: because these processes do not (need to) see each other's
> memory, you can connect many nodes together to divide the work over many
> different nodes, and thus be done faster. Now, the MPI implementation is
> per-particle probably a bit faster than the thread-implementation:
> processing 1 particle will go faster with 8 MPIs than with 8 threads. The
> difference depends however on many things. Therefore, some jobs will go
> (somewhat) faster by not using as many threads as there are cores on each
> node. Of course, you can only do this as long as the memory required is
> small enough to replicate for example the memory of 2 or 4 MPI processes
> on each node (each running 4 or 2 threads, respectively). However, there
> is often a limit on scalability of MPI processes as well: at the end of an
> (expectation) iteration, the results from all MPI processes need to be
> combined. This is quite a bit of data that needs to go either through the
> network cable or is written and read to/from the hard disk. There is a
> command line option called (--dont_combine_weights_via_disc) to control
> this choice. We had problems with unstable network connections on our
> cluster and had more stable (less crashed) runs using the disk. But disk
> access may also become very limiting in speed. You will notice that the
> mouse has reached the cheese and it still takes a long time to go into
> maximization. Using fewer MPI processes and more threads makes this
> problem less severe. So, I'm afraid that in the end the answer to your
> question is 'it depends', it depends on your calculations (2D or 3D
> classification or auto-refine), box size, SNRs, but also your computer
> cluster setup. Hopefully the information in this message helps in finding
> a good strategy on your system.
> HTH,
> Sjors
>
>
>
>> Hi,
>> I have two ignorant and possibly incoherent questions to ask. Are
> certain
>> box sizes better to use for improving computation time (i.e. 512 vs
> 496)?
>> Also, are there any advices on what strategy of threads/node/MPIs one
> should use to expedite structure solution, or is this an empirical
> determinant?
>> Best wishes,
>> Reza
>> Reza Khayat, PhD
>> Assistant Professor
>> City College of New York
>> Department of Chemistry
>> New York, NY 10031
>> ________________________________________
>> From: Collaborative Computational Project in Electron cryo-Microscopy
> <[log in to unmask]> on behalf of Robert McLeod
>> <[log in to unmask]>
>> Sent: Sunday, January 17, 2016 5:37 PM
>> To: [log in to unmask]
>> Subject: Re: [ccpem] slow last iteration in autorefine
>> Evening,
>> I did some quick benchmarks on my home computer (iCore5 with 4 threads),
> it seems 496x496x496 is not too horrible:
>> FFTPack time for 512 in (s): 5.402000
>> FFTW planning time (FFTW_MEASURE) for 512^3 in (s): 0.962000
>> FFTW execution time (FFTW_MEASURE) for 512^3 in (s): 0.386000
>> FFTW planning time (FFTW_PATIENT) for 512^3 in (s): 29.558000
>> FFTW execution time (FFTW_PATIENT) for 512^3 in (s): 0.428000
>> FFTPack time for 496^3 in (s): 11.753000
>> FFTW planning time (FFTW_MEASURE) for 496^3 in (s): 0.909000
>> FFTW execution time (FFTW_MEASURE) for 496^3 in (s): 1.131000
>> FFTW planning time (FFTW_PATIENT) for 496^3 in (s): 19.574000
>> FFTW execution time (FFTW_PATIENT) for 496^3 in (s): 1.219000
>> The 512x512x512 box is only about 250 % faster than the 496x496x496 box.
>
>> This is too small to make a 3 minute job into a 4 hour job. Although
> 496^3 can be subdivided into equal blocks with 4 threads, not so with 15
> threads.
>> On the higher planning levels (MEASURE and PATIENT) actual
> cycle-counting
>> simulations are done on the processor to assess which arrangement of
> blocks to execute the FFT algorithm on. Sometimes this doesn't work
> well.
>> I have seen, intermittently, that if the planning is run on a processor
>> that is busy with other threads that it will return a terrible plan that
> will then slow everything to a crawl until I manually go in and re-plan
> /
>> delete the FFTW wisdom. Looking at the source, Relion seems to use
> FFTW_ESTIMATE, which is generally less aggressive and hence safer (but I
> use FFTW_MEASURE). It looks like there's some debug output if you set
> #define DEBUG_PLANS somewhere during compilation (in fftw.h for
> example).
>> Robert
>> --
>> Robert McLeod, Ph.D.
>> Center for Cellular Imaging and Nano Analytics (C-CINA)
>> Biozentrum der Universität Basel
>> Mattenstrasse 26, 4058 Basel
>> Office: +41.061.387.3225
>> [log in to unmask]
>> [log in to unmask]
>> [log in to unmask]
>> ________________________________________
>> From: Collaborative Computational Project in Electron cryo-Microscopy
> [[log in to unmask]] on behalf of Leo Sazanov [[log in to unmask]]
> Sent: Sunday, January 17, 2016 6:40 PM
>> To: [log in to unmask]
>> Subject: Re: [ccpem] slow last iteration in autorefine
>> Dear Sjors,
>> Thank you - we tried various combinations before and the one with 2 MPIs
> (15 threads each) per node gives the fastest "normal" iterations. This
> set up also seems to be the best for the last iteration, although it is
> difficult to be sure as it still takes 2-4 days to run (and this is with
> up to 15 nodes in total per job).
>> But do you think Robert MacLeod suggestion about big prime number in the
> decomposition of 496 might be correct?
>> This would actually be consistent with the fact that in the 3D
>> classification run consecutive maximization iterations can run in the
> pattern like this: 3 mins, 4 hours, 3 mins, 3 mins, 4 hours, etc. And
> those taking long to run do have big prime number in the
>> decomposition of CurrentImageSize for the iteration.
>> Although there seems to be no strict dependence as some iterations with
> big prime number in the decomposition of CurrentImageSize do run fast.
> If that is right we will try 512 box size.
>> What do you think?
>> Leo
>> Prof. Leonid Sazanov
>> IST Austria
>> Am Campus 1
>> A-3400 Klosterneuburg
>> Austria
>> Phone: +43 2243 9000 3026
>> E-mail: [log in to unmask]
>> Web: https://ist.ac.at/research/life-sciences/sazanov-group/
>> On 17/01/2016 16:59, Sjors Scheres wrote:
>>> Dear Leo,
>>> If each MPI node takes 30Gb, you could run multiple MPI processes per
> node. Having 32 hyper-threaded cores, you could run for example run 2
> MPIs
>>> per node, each launching 16 threads. Perhaps 4 MPIs, each running 8
> threads may run a bit faster. Then, you could scale up by using as many
> nodes as you have in your cluster. If you have say 10 of those nodes,
> then
>>> it shouldn't take 3 days for a single iteration.
>>> HTH,
>>> Sjors
>>>> Dear all,
>>>> We are still struggling with this - it is very frustrating that with 496
>>>> pixel box the last maximization iteration in autorefine takes 2-3-4 days
>>>> (and apparently nothing happens during this time, no progress output,
> though CPUs are used).
>>>> We have plenty of CPUs (usually we use ~17 MPIs with 15 threads = 255
> threads per job) and memory (128 GB per node with 32 hyper-threaded
> cores), so there is no swapping to disk. Memory requested by Relion in
> the
>>>> last iteration is about 30GB.
>>>> I wonder if people could share their examples of how long this iteration
>>>> takes on their set-up, especially with large box of about 500 pixels?
> And whether anybody resolved similar problem?
>>>> Many thanks!
>>>>> Hi Leo,
>>>> It also puts pixels until Nyquist back into the 3D transform, so will
> cost
>>>> more CPU than the other iterations.
>>>> HTH
>>>> Sjors
>>>>> Hi, still an important question for us -
>>>>> It does not look like overall I/O cluster load is a big issue and memory
>>>>> also is not an issue.
>>>>> What else can be done to speed up the last iteration in 3D autorefine
> (496
>>>>> box, 128 GB memory per node)?
>>>>> Now it takes up to several days so we really want to do something about
>>>>> it.
>>>>> Apart from using more memory per image, what else is different about the
>>>>> last 3D autorefine operation so that it is so slow?
>>>>> Many thanks!
>>>>> On our cluster we started to get exceedingly long times for the last
> iteration in 3D autorefine (with large box). There is definitely
> enough
>>>>> RAM so there is no swapping. Previously the same jobs run about 10X
> faster
>>>>> on our cluster, so I wonder if the problem is in general I/O
>>>>> bottlenecks
>>>>> in the cluster.
>>>>> Is there a lot of particle images reading in the final maximisation step
>>>>> (takes up to a day now)?
>>>>> Thanks!
>
>
> --
> Sjors Scheres
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue, Cambridge Biomedical Campus
> Cambridge CB2 0QH, U.K.
> tel: +44 (0)1223 267061
> http://www2.mrc-lmb.cam.ac.uk/groups/scheres
|