JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for CCPEM Archives


CCPEM Archives

CCPEM Archives


CCPEM@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

CCPEM Home

CCPEM Home

CCPEM  March 2014

CCPEM March 2014

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: mpi vs threads

From:

Sjors Scheres <[log in to unmask]>

Reply-To:

Sjors Scheres <[log in to unmask]>

Date:

Thu, 6 Mar 2014 16:11:26 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (106 lines)

Hi Ludo,

As this is a very common question, let me answer (again) in a bit more 
detail. I'm afraid there is not a single answer for all, but hopefully 
this will help you to understand your system better. You could also get 
more info from the RELION tutorial and WIKI as well as a review I wrote 
a while ago on ML classification (in XMIPP, but most statements apart 
from the ART algorithm still hold for RELION): 
http://www.sciencedirect.com/science/article/pii/S0076687910820129

There are two complementary ways of parallelising your calculations in 
RELION: distributed-memory parallelisation through MPI and shared-memory 
parallelisation through (posix) threads. If you had a 12-core machine 
you could run 12 MPI jobs each with 1 thread, or 1 sequential job with 
12 threads, or 4 MPI jobs with 3 threads, 6 MPI jobs with 2 threads etc. 
Some people have seen that putting slightly more threads than available 
cores may actually yield better speeds, probably because the threads do 
not run at a 100% CPU most of the time.

The hybrid parallelisation scheme allows to get the most out of a 
multi-core cluster. You'll have to run at least 1 MPI job per cluster 
node (because pthreads cannot see the memory of a distinct node), but 
could run more than one on each node. It will depend on the type of job 
that you're running what is the most efficient: 2D classifications 
typically take less RAM than 3D runs. If you use little RAM, than MPI 
may be more efficient because they each run at 100% CPU most of the 
time, whereas as mentioned above the threads in relion do not. However, 
there will always be a turning point, where the communications between 
many MPI processes will start to take longer than the gain in speed of 
using more of them. Again, 3D runs take longer to communicate between 
each other because there is more data to share. If you use a lot of RAM, 
e.g. in 3D runs with large volumes, then running many MPI jobs on each 
cluster node will become a problem as they each take so much RAM that 
you run out of it. That's where the threads come in handy: you can still 
perform your calculations in parallel, yet without replicating the 
memory in distinct MPI processes.

With repect to scalability: there are many potential bottle necks on as 
many possible setups. Combining all the information from each MPI 
process at the end end of every iteration is certainly one of them. 
Previous versions of relion did this by default over the network, which 
on our cluster led to instabilities, possibly due to bugs on some of our 
network cards. We then moved to combination of all information through 
the writing out of large (~Gb) temporary files. This may also quickly 
become limiting if you have relatively slow access to disc. You can 
revert to the previous version by using the 
--dont_combine_weights_via_disc argument. Another bottle-neck is reading 
the images from disc (all done by the msater MPI process) and sending 
the information to all of the other MPI nodes. This will depend on the 
speed of the network connections between your computing nodes. Again, 
earlier versions of relion had all MPI processes reading in parallel, 
but we had some serious scalability issues there: our nfs would have big 
trouble when too many processes would do this simultaneously. We then 
moved to the master-reads-all setting (and removed functionality for 
parallel reading). Running very many threads on jobs with relatively few 
calculations per particle (e.g. 2D classification) may also bring 
efficiency down, as thread-overhead starts to become more important than 
the gain in parallel calculations. This can be monitored by using top. 
n-threaded jobs would ideally run all the times at n*100% CPU.

Finally: on our setup (12-core nodes, each with 64Gb RAM) we typically 
use up to 200-300 cores per job. We mostly use only a few threads and 
many MPI processes for 2D classifications and increase the number of 
threads (and bring down the number of MPI processes simultaneously) for 
the larger-RAM-requirement 3D refinements and classifications. We have 
observed that using 1,000 cores does _not_ run faster than using ~300 
cores on our system (!!): so there certainly is a limit to scalability 
you should take into account. (This may vary wildly from cluster to 
cluster though, depending on all the things I mentioned above.) Still, 
when using data sets of initially 100-200k particles (at ~1.3 A/pixel) 
and classifying several tens of thousands out of those, using 200-300 
CPUs in parallel to do everything from auto-picking to final movie 
refinement can give you a better than 3.5A resolution map in less than 2 
weeks wall-clock time (for favourable samples like ribosomes).

HTH, S


On 03/06/2014 09:48 AM, Ludovic Renault wrote:
> Hi,
> I have been given access to Durham's Cluster at a reasonable price per 
> cpu per hour.
> While it is reasonable, I still would like us to spend as little money 
> as possible while keeping a good processing speed when using relion. I 
> was thus thinking of using more memory and less cores.
> I have tried to play on my local machine with the J value and the 
> number of MPI. I do see a similar processing time with 10MPI/J1 and 4 
> MPI/J4, and even up to 10 MPI/J8 is faster then 10 MPI/J1.
> I know the Durham's cluster has hyperthreading turned on and lots of 
> memory available that relion is not really using. Do you think it 
> would make sense to increase the J value to reduce the cost of usage 
> or will our runs will just be longer and thus the gain wouldn't be worth?
> I have been asked if the code was scaling very well or not ... I 
> suspect it is and that is why using J is not recommended. Is that correct?
> Have you done any benchmarking on your own cluster?
> Any comments would be much appreciated.
> Thanks,
> Ludo

-- 
Sjors Scheres
MRC Laboratory of Molecular Biology
Francis Crick Avenue, Cambridge Biomedical Campus
Cambridge CB2 0QH, U.K.
tel: +44 (0)1223 267061
http://www2.mrc-lmb.cam.ac.uk/groups/scheres

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager