Print

Print


Thanks Craig,
This is likely very useful information for many people!
best,
Sjors

> Glad to help,
> I definitely noticed, that’s how I caught it (this default behavior in
> OpenMPI also changed at some point around v1.6) but most people launch mpi
> programs through some kind of scheduler that may mask this default.
> I’ve noticed that when running MPI programs under Slurm, this doesn’t
> happen.
>
>
>> On Apr 28, 2016, at 1:28 PM, Bharat Reddy <[log in to unmask]> wrote:
>>
>> Hi Craig,
>>
>> Your suggestion worked perfectly. 'mpirun --bind-to none' let me use
>> the number of threads I requested relion_refine_mpi to use. From
>> reading the mpirun documentation, by default processes are bound to a
>> core. I can only guess because my Intel processors have hyperthreading,
>> mpirun was only running two threads per core.
>>
>> I am curious how many people are aware of OpenMPI's NUMA locking
>> support as I suspect many people are under utilizing their hardware.
>>
>> Cheers,
>> BR
>>
>>
>> On Thu, 2016-04-28 at 19:19 +0000, Craig Yoshioka wrote:
>>> If using OpenMPI compiled with NUMA locking support (likely default),
>>> MPI processes get bound to assigned cores automatically.  This is to
>>> prevent context switches and cache misses.  If you are invoking via
>>> `mpirun` I believe the correct flag is `mpirun —bind none`… or you
>>> can let MPI know that each process is going to use more than a single
>>> thread using —map-by or some other flag.
>>>
>>>
>>>
>>>>
>>>> On Apr 28, 2016, at 12:06 PM, Sjors Scheres <[log in to unmask]
>>>> .UK> wrote:
>>>>
>>>> One more thing: you could give the program A LOT to calculate for
>>>> each
>>>> particle, just as a test whether it's access to the images that is
>>>> somehow
>>>> a bottle neck. You could do this by increasing the angular sampling
>>>> 4-fold.
>>>> HTH,
>>>> S
>>>>>
>>>>> Hi Sjors,
>>>>>
>>>>> I agree I am being hampered by somethings. My workstation has
>>>>> dual 10-
>>>>> core Xeon processors which with hyperthreading should give me a
>>>>> total
>>>>> of 40 threads. Additionally it has 128GB of ram and a 8 harddrive
>>>>> raid
>>>>> 5 array giving over 1GB/s throughput, which according to 'top'
>>>>> isn't
>>>>> being taxed at all.
>>>>>
>>>>> An interesting observation is that when I use relion_refine, I
>>>>> have no
>>>>> problems of it running more than 2 threads (4, 8, 16, etc.
>>>>> verified
>>>>> using 'top'). However when I use the relion_refine_mpi, I am
>>>>> stuck at 2
>>>>> threads per mpi process.
>>>>>
>>>>> Some additional info:
>>>>> 1) I have seen this issue on two different systems running
>>>>> Centos7 and
>>>>> Fedora 23 with 20 and 4 cores respectively.
>>>>> 2) I have see issue with my own compiled version of relion 1.4
>>>>> and
>>>>> SBGrid.org's compiled version.
>>>>> 3) I suspect this might be a reason the program is not scaling as
>>>>> well
>>>>> as we would like on our cluster, a Cray XE6 system.
>>>>>
>>>>> Cheers,
>>>>> Bharat Reddy
>>>>> Post Doc
>>>>> University of Chicago
>>>>>
>>>>>
>>>>> On Thu, 2016-04-28 at 10:10 +0100, Sjors Scheres wrote:
>>>>>>
>>>>>> Hi again,
>>>>>> The program is actually using 4 threads (as from the stdout).
>>>>>> The
>>>>>> fact
>>>>>> top runs at 200% means that your threads are hampered by
>>>>>> something
>>>>>> else.
>>>>>> This could for example be the reading of particles from the
>>>>>> hard
>>>>>> disk,
>>>>>> which can become a bottle neck. Also: how many cores does
>>>>>> nsit-dhcp-148-090.bsd.uchicago.edu have? You're running 4 MPI
>>>>>> slaves,
>>>>>> each with 4 threads on it. The master also takes 1 core.
>>>>>> Therefore,
>>>>>> your
>>>>>> machine should have 17 cores to do everything you ask for. If
>>>>>> it has
>>>>>> fewer cores, then they'll just be in each others way.
>>>>>> HTH,
>>>>>> Sjors
>>>>>>
>>>>>> On 04/27/2016 08:18 PM, Baru Reddy wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Sjors,
>>>>>>> 'top' says each mpi process is running at ~200%. This is the
>>>>>>> criteria by which I say it is only using 2 threads is based
>>>>>>> on the
>>>>>>> ~200%. The command I use and initial output I get is shown
>>>>>>> below.
>>>>>>>
>>>>>>> mpirun -n 5 ~/Downloads/relion-1.4/bin/relion_refine_mpi --o
>>>>>>> Class3D/run1_ct5 --continue Class3D/run1_it005_optimiser.star
>>>>>>> --
>>>>>>> iter 25 --tau2_fudge 4 --solvent_mask proteasome_mask_150.mrc
>>>>>>> --
>>>>>>> oversampling 1 --healpix_order 3 --offset_range 5 --
>>>>>>> offset_step 2
>>>>>>> --j 4  &
>>>>>>>
>>>>>>> [reddybg@nsit-dhcp-148-090 gauto]$  === RELION MPI setup ===
>>>>>>>   + Number of MPI processes             = 5
>>>>>>>   + Number of threads per MPI process  = 4
>>>>>>>   + Total number of threads therefore  = 20
>>>>>>>   + Master  (0) runs on host            =
>>>>>> nsit-dhcp-148-
>>>>>>>
>>>>>>> 090.bsd.uchicago.edu
>>>>>>>   + Slave     1 runs on host            =
>>>>>> nsit-dhcp-148-
>>>>>>>
>>>>>>> 090.bsd.uchicago.edu
>>>>>>>   + Slave     2 runs on host            =
>>>>>> nsit-dhcp-148-
>>>>>>>
>>>>>>> 090.bsd.uchicago.edu
>>>>>>>   + Slave     3 runs on host            =
>>>>>> nsit-dhcp-148-
>>>>>>>
>>>>>>> 090.bsd.uchicago.edu
>>>>>>>   + Slave     4 runs on host            =
>>>>>> nsit-dhcp-148-
>>>>>>>
>>>>>>> 090.bsd.uchicago.edu
>>>>>>>
>>>>>>> Cheers,Bharat ReddyPost DocUniversity of Chicago
>>>>>>>
>>>>>>>        From: Sjors Scheres <[log in to unmask]>
>>>>>>>   To: Baru Reddy <[log in to unmask]>
>>>>>>> Cc: [log in to unmask]
>>>>>>>   Sent: Wednesday, April 27, 2016 2:08 PM
>>>>>>>   Subject: Re: [ccpem] Stuck at 2 Threads per MPI Process
>>>>>>>
>>>>>>> Hi Bharat,
>>>>>>> --j N should always launch N threads. You'll only see them as
>>>>>>> 1
>>>>>>> process in
>>>>>>> 'top', but it may run up to ~N00%. Why do you say relion
>>>>>>> launches
>>>>>>> only 2
>>>>>>> threads? How do you see this? Does it say so in the stdout?
>>>>>>> S
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>> Currently we are trying to mobilize the power of threads as
>>>>>>>> our
>>>>>>>> refinements have become more memory intensive and we have
>>>>>>>> hit a
>>>>>>>> limit with
>>>>>>>> the number of MPI processes we can deploy. The problem is
>>>>>>>> that
>>>>>>>> however
>>>>>>>> many threads I tell relion_refine_mpi to use (-j X where X
>>>>>>>> is
>>>>>>>> 4,8,16,etc.), it only uses two threads. Is there a setting
>>>>>>>> I am
>>>>>>>> missing, a
>>>>>>>> variable I am failing to define, or is this a limit of
>>>>>>>> relion_refine_mpi .
>>>>>>>> Cheers,Bharat ReddyPost DocUniversity of Chicago
>>>>
>>>> --
>>>> Sjors Scheres
>>>> MRC Laboratory of Molecular Biology
>>>> Francis Crick Avenue, Cambridge Biomedical Campus
>>>> Cambridge CB2 0QH, U.K.
>>>> tel: +44 (0)1223 267061
>>>> http://www2.mrc-lmb.cam.ac.uk/groups/scheres
>
>


-- 
Sjors Scheres
MRC Laboratory of Molecular Biology
Francis Crick Avenue, Cambridge Biomedical Campus
Cambridge CB2 0QH, U.K.
tel: +44 (0)1223 267061
http://www2.mrc-lmb.cam.ac.uk/groups/scheres