Thanks Craig, This is likely very useful information for many people! best, Sjors > Glad to help, > I definitely noticed, that’s how I caught it (this default behavior in > OpenMPI also changed at some point around v1.6) but most people launch mpi > programs through some kind of scheduler that may mask this default. > I’ve noticed that when running MPI programs under Slurm, this doesn’t > happen. > > >> On Apr 28, 2016, at 1:28 PM, Bharat Reddy <[log in to unmask]> wrote: >> >> Hi Craig, >> >> Your suggestion worked perfectly. 'mpirun --bind-to none' let me use >> the number of threads I requested relion_refine_mpi to use. From >> reading the mpirun documentation, by default processes are bound to a >> core. I can only guess because my Intel processors have hyperthreading, >> mpirun was only running two threads per core. >> >> I am curious how many people are aware of OpenMPI's NUMA locking >> support as I suspect many people are under utilizing their hardware. >> >> Cheers, >> BR >> >> >> On Thu, 2016-04-28 at 19:19 +0000, Craig Yoshioka wrote: >>> If using OpenMPI compiled with NUMA locking support (likely default), >>> MPI processes get bound to assigned cores automatically. This is to >>> prevent context switches and cache misses. If you are invoking via >>> `mpirun` I believe the correct flag is `mpirun —bind none`… or you >>> can let MPI know that each process is going to use more than a single >>> thread using —map-by or some other flag. >>> >>> >>> >>>> >>>> On Apr 28, 2016, at 12:06 PM, Sjors Scheres <[log in to unmask] >>>> .UK> wrote: >>>> >>>> One more thing: you could give the program A LOT to calculate for >>>> each >>>> particle, just as a test whether it's access to the images that is >>>> somehow >>>> a bottle neck. You could do this by increasing the angular sampling >>>> 4-fold. >>>> HTH, >>>> S >>>>> >>>>> Hi Sjors, >>>>> >>>>> I agree I am being hampered by somethings. My workstation has >>>>> dual 10- >>>>> core Xeon processors which with hyperthreading should give me a >>>>> total >>>>> of 40 threads. Additionally it has 128GB of ram and a 8 harddrive >>>>> raid >>>>> 5 array giving over 1GB/s throughput, which according to 'top' >>>>> isn't >>>>> being taxed at all. >>>>> >>>>> An interesting observation is that when I use relion_refine, I >>>>> have no >>>>> problems of it running more than 2 threads (4, 8, 16, etc. >>>>> verified >>>>> using 'top'). However when I use the relion_refine_mpi, I am >>>>> stuck at 2 >>>>> threads per mpi process. >>>>> >>>>> Some additional info: >>>>> 1) I have seen this issue on two different systems running >>>>> Centos7 and >>>>> Fedora 23 with 20 and 4 cores respectively. >>>>> 2) I have see issue with my own compiled version of relion 1.4 >>>>> and >>>>> SBGrid.org's compiled version. >>>>> 3) I suspect this might be a reason the program is not scaling as >>>>> well >>>>> as we would like on our cluster, a Cray XE6 system. >>>>> >>>>> Cheers, >>>>> Bharat Reddy >>>>> Post Doc >>>>> University of Chicago >>>>> >>>>> >>>>> On Thu, 2016-04-28 at 10:10 +0100, Sjors Scheres wrote: >>>>>> >>>>>> Hi again, >>>>>> The program is actually using 4 threads (as from the stdout). >>>>>> The >>>>>> fact >>>>>> top runs at 200% means that your threads are hampered by >>>>>> something >>>>>> else. >>>>>> This could for example be the reading of particles from the >>>>>> hard >>>>>> disk, >>>>>> which can become a bottle neck. Also: how many cores does >>>>>> nsit-dhcp-148-090.bsd.uchicago.edu have? You're running 4 MPI >>>>>> slaves, >>>>>> each with 4 threads on it. The master also takes 1 core. >>>>>> Therefore, >>>>>> your >>>>>> machine should have 17 cores to do everything you ask for. If >>>>>> it has >>>>>> fewer cores, then they'll just be in each others way. >>>>>> HTH, >>>>>> Sjors >>>>>> >>>>>> On 04/27/2016 08:18 PM, Baru Reddy wrote: >>>>>>> >>>>>>> >>>>>>> Hi Sjors, >>>>>>> 'top' says each mpi process is running at ~200%. This is the >>>>>>> criteria by which I say it is only using 2 threads is based >>>>>>> on the >>>>>>> ~200%. The command I use and initial output I get is shown >>>>>>> below. >>>>>>> >>>>>>> mpirun -n 5 ~/Downloads/relion-1.4/bin/relion_refine_mpi --o >>>>>>> Class3D/run1_ct5 --continue Class3D/run1_it005_optimiser.star >>>>>>> -- >>>>>>> iter 25 --tau2_fudge 4 --solvent_mask proteasome_mask_150.mrc >>>>>>> -- >>>>>>> oversampling 1 --healpix_order 3 --offset_range 5 -- >>>>>>> offset_step 2 >>>>>>> --j 4 & >>>>>>> >>>>>>> [reddybg@nsit-dhcp-148-090 gauto]$ === RELION MPI setup === >>>>>>> + Number of MPI processes = 5 >>>>>>> + Number of threads per MPI process = 4 >>>>>>> + Total number of threads therefore = 20 >>>>>>> + Master (0) runs on host = >>>>>> nsit-dhcp-148- >>>>>>> >>>>>>> 090.bsd.uchicago.edu >>>>>>> + Slave 1 runs on host = >>>>>> nsit-dhcp-148- >>>>>>> >>>>>>> 090.bsd.uchicago.edu >>>>>>> + Slave 2 runs on host = >>>>>> nsit-dhcp-148- >>>>>>> >>>>>>> 090.bsd.uchicago.edu >>>>>>> + Slave 3 runs on host = >>>>>> nsit-dhcp-148- >>>>>>> >>>>>>> 090.bsd.uchicago.edu >>>>>>> + Slave 4 runs on host = >>>>>> nsit-dhcp-148- >>>>>>> >>>>>>> 090.bsd.uchicago.edu >>>>>>> >>>>>>> Cheers,Bharat ReddyPost DocUniversity of Chicago >>>>>>> >>>>>>> From: Sjors Scheres <[log in to unmask]> >>>>>>> To: Baru Reddy <[log in to unmask]> >>>>>>> Cc: [log in to unmask] >>>>>>> Sent: Wednesday, April 27, 2016 2:08 PM >>>>>>> Subject: Re: [ccpem] Stuck at 2 Threads per MPI Process >>>>>>> >>>>>>> Hi Bharat, >>>>>>> --j N should always launch N threads. You'll only see them as >>>>>>> 1 >>>>>>> process in >>>>>>> 'top', but it may run up to ~N00%. Why do you say relion >>>>>>> launches >>>>>>> only 2 >>>>>>> threads? How do you see this? Does it say so in the stdout? >>>>>>> S >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Everyone, >>>>>>>> Currently we are trying to mobilize the power of threads as >>>>>>>> our >>>>>>>> refinements have become more memory intensive and we have >>>>>>>> hit a >>>>>>>> limit with >>>>>>>> the number of MPI processes we can deploy. The problem is >>>>>>>> that >>>>>>>> however >>>>>>>> many threads I tell relion_refine_mpi to use (-j X where X >>>>>>>> is >>>>>>>> 4,8,16,etc.), it only uses two threads. Is there a setting >>>>>>>> I am >>>>>>>> missing, a >>>>>>>> variable I am failing to define, or is this a limit of >>>>>>>> relion_refine_mpi . >>>>>>>> Cheers,Bharat ReddyPost DocUniversity of Chicago >>>> >>>> -- >>>> Sjors Scheres >>>> MRC Laboratory of Molecular Biology >>>> Francis Crick Avenue, Cambridge Biomedical Campus >>>> Cambridge CB2 0QH, U.K. >>>> tel: +44 (0)1223 267061 >>>> http://www2.mrc-lmb.cam.ac.uk/groups/scheres > > -- Sjors Scheres MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge Biomedical Campus Cambridge CB2 0QH, U.K. tel: +44 (0)1223 267061 http://www2.mrc-lmb.cam.ac.uk/groups/scheres