As this is a very common question, let me answer (again) in a bit more
detail. I'm afraid there is not a single answer for all, but hopefully
this will help you to understand your system better. You could also get
more info from the RELION tutorial and WIKI as well as a review I wrote
a while ago on ML classification (in XMIPP, but most statements apart
from the ART algorithm still hold for RELION):
There are two complementary ways of parallelising your calculations in
RELION: distributed-memory parallelisation through MPI and shared-memory
parallelisation through (posix) threads. If you had a 12-core machine
you could run 12 MPI jobs each with 1 thread, or 1 sequential job with
12 threads, or 4 MPI jobs with 3 threads, 6 MPI jobs with 2 threads etc.
Some people have seen that putting slightly more threads than available
cores may actually yield better speeds, probably because the threads do
not run at a 100% CPU most of the time.
The hybrid parallelisation scheme allows to get the most out of a
multi-core cluster. You'll have to run at least 1 MPI job per cluster
node (because pthreads cannot see the memory of a distinct node), but
could run more than one on each node. It will depend on the type of job
that you're running what is the most efficient: 2D classifications
typically take less RAM than 3D runs. If you use little RAM, than MPI
may be more efficient because they each run at 100% CPU most of the
time, whereas as mentioned above the threads in relion do not. However,
there will always be a turning point, where the communications between
many MPI processes will start to take longer than the gain in speed of
using more of them. Again, 3D runs take longer to communicate between
each other because there is more data to share. If you use a lot of RAM,
e.g. in 3D runs with large volumes, then running many MPI jobs on each
cluster node will become a problem as they each take so much RAM that
you run out of it. That's where the threads come in handy: you can still
perform your calculations in parallel, yet without replicating the
memory in distinct MPI processes.
With repect to scalability: there are many potential bottle necks on as
many possible setups. Combining all the information from each MPI
process at the end end of every iteration is certainly one of them.
Previous versions of relion did this by default over the network, which
on our cluster led to instabilities, possibly due to bugs on some of our
network cards. We then moved to combination of all information through
the writing out of large (~Gb) temporary files. This may also quickly
become limiting if you have relatively slow access to disc. You can
revert to the previous version by using the
--dont_combine_weights_via_disc argument. Another bottle-neck is reading
the images from disc (all done by the msater MPI process) and sending
the information to all of the other MPI nodes. This will depend on the
speed of the network connections between your computing nodes. Again,
earlier versions of relion had all MPI processes reading in parallel,
but we had some serious scalability issues there: our nfs would have big
trouble when too many processes would do this simultaneously. We then
moved to the master-reads-all setting (and removed functionality for
parallel reading). Running very many threads on jobs with relatively few
calculations per particle (e.g. 2D classification) may also bring
efficiency down, as thread-overhead starts to become more important than
the gain in parallel calculations. This can be monitored by using top.
n-threaded jobs would ideally run all the times at n*100% CPU.
Finally: on our setup (12-core nodes, each with 64Gb RAM) we typically
use up to 200-300 cores per job. We mostly use only a few threads and
many MPI processes for 2D classifications and increase the number of
threads (and bring down the number of MPI processes simultaneously) for
the larger-RAM-requirement 3D refinements and classifications. We have
observed that using 1,000 cores does _not_ run faster than using ~300
cores on our system (!!): so there certainly is a limit to scalability
you should take into account. (This may vary wildly from cluster to
cluster though, depending on all the things I mentioned above.) Still,
when using data sets of initially 100-200k particles (at ~1.3 A/pixel)
and classifying several tens of thousands out of those, using 200-300
CPUs in parallel to do everything from auto-picking to final movie
refinement can give you a better than 3.5A resolution map in less than 2
weeks wall-clock time (for favourable samples like ribosomes).
On 03/06/2014 09:48 AM, Ludovic Renault wrote:
> I have been given access to Durham's Cluster at a reasonable price per
> cpu per hour.
> While it is reasonable, I still would like us to spend as little money
> as possible while keeping a good processing speed when using relion. I
> was thus thinking of using more memory and less cores.
> I have tried to play on my local machine with the J value and the
> number of MPI. I do see a similar processing time with 10MPI/J1 and 4
> MPI/J4, and even up to 10 MPI/J8 is faster then 10 MPI/J1.
> I know the Durham's cluster has hyperthreading turned on and lots of
> memory available that relion is not really using. Do you think it
> would make sense to increase the J value to reduce the cost of usage
> or will our runs will just be longer and thus the gain wouldn't be worth?
> I have been asked if the code was scaling very well or not ... I
> suspect it is and that is why using J is not recommended. Is that correct?
> Have you done any benchmarking on your own cluster?
> Any comments would be much appreciated.
MRC Laboratory of Molecular Biology
Francis Crick Avenue, Cambridge Biomedical Campus
Cambridge CB2 0QH, U.K.
tel: +44 (0)1223 267061