JISCMail - CCPEM Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
CCPEM Archives

CCPEM@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		CCPEM Home
		CCPEM May 2017
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Relion - Tests on a 8 GPU node
From:
Bharat Reddy <[log in to unmask]>
Reply-To:
Bharat Reddy <[log in to unmask]>
Date:
Fri, 12 May 2017 18:34:08 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (542 lines)
Hi Clara,

Are you guys running one mpi process per gpu in your example below? If
so, my point is if you have enough gpu memory to spare, you should try
running more than one mpi process per gpu to get the most out of your
hardware.  Do you guys have access to 16/18/20/22/24 core cpus to test
your 8 GPU nodes on? I am curious how much faster it would run if you
ran two (or more depending on the cpu) mpi processes per gpu. 

That being said, your setup is already running the standard relion
benchmark pretty quick. However we are collecting millions of particles
in some of our datasets and while some small potential performance
gains might not mean much in benchmarks, it can equal many hours of
computing time with our non-ideal data sets. 

Cheers,
BR

On Fri, 2017-05-12 at 15:44 -0700, Dr. Clara Cai wrote:
> Dear Bharat
> 
> In our testing with 8GPU machine, dual 12-core CPUs are actually okay
> for balancing the load for 8xGTX1080Ti (below is our iteration
> results for 3d classification). Your concern with physical core count
> is definitely right, but we believe that the workload still works due
> to the fact that some threads are doing IO or waiting for GPU results
> and this is a case where Intel Hyperthreading does work alright. 
> 
> The threading parameter j could be reduced to 3 for a better match
> with the physical core count of 24, but we did not see any
> improvement over using j=4. Of course, setting j too high would not
> be recommended as it will oversubscribe the CPUs with penalty. 
> 
> Best regards, 
> 
> -Clara
> 
> 
> 8x GTX1080Ti GPU server
> Apr 20 22:36 timer_start
> Apr 20 22:39 class3d_it000_model.star
> Apr 20 22:42 class3d_it001_model.star
> Apr 20 22:44 class3d_it002_model.star
> Apr 20 22:46 class3d_it003_model.star
> Apr 20 22:48 class3d_it004_model.star
> Apr 20 22:50 class3d_it005_model.star
> Apr 20 22:52 class3d_it006_model.star
> Apr 20 22:55 class3d_it007_model.star
> Apr 20 22:57 class3d_it008_model.star
> Apr 20 22:59 class3d_it009_model.star
> Apr 20 23:02 class3d_it010_model.star
> Apr 20 23:04 class3d_it011_model.star
> Apr 20 23:07 class3d_it012_model.star
> Apr 20 23:09 class3d_it013_model.star
> Apr 20 23:12 class3d_it014_model.star
> Apr 20 23:15 class3d_it015_model.star
> Apr 20 23:17 class3d_it016_model.star
> Apr 20 23:20 class3d_it017_model.star
> Apr 20 23:23 class3d_it018_model.star
> Apr 20 23:26 class3d_it019_model.star
> Apr 20 23:29 class3d_it020_model.star
> Apr 20 23:31 class3d_it021_model.star
> Apr 20 23:34 class3d_it022_model.star
> Apr 20 23:37 class3d_it023_model.star
> Apr 20 23:40 class3d_it024_model.star
> Apr 20 23:43 class3d_it025_model.star
> 
> On Fri, May 12, 2017 at 3:09 PM, Bharat Reddy <000009a7465b91d2-dmarc
> [log in to unmask]> wrote:
> > Hi Nicolas,
> > 
> > Your results are rather slow if you are using the official relion
> > benchmark dataset. Please confirm you are using the official relion
> > benchmark dataset. Also as Clara mentioned, please confirm you are
> > using the --gpu option in your command.
> > 
> > You said in your previous email you are using two E5-2650 v4 cpus.
> > This
> > means you only have 24 real cores. Often despite the number of
> > threads
> > you request, relion only uses the power of two threads per mpi
> > process
> > (runing the program `top` rarely showing more than 200% cpu usage
> > per
> > relion mpi process). Using this observation of needing only 2 cores
> > per
> > mpi process, this is likely why you see a slow down on your 8GPU
> > jobs
> > between 9 and 17 MPI processes. You are requesting more threads
> > than
> > you have cores with the 17 MPI jobs. There is a performance penalty
> > of
> > switching a process from core to core despite having a total
> > capability
> > of 48 threads due to hyperthreading. Hyperthreading is like
> > sprinkles
> > on a cake. They do not make your cake taste much better, but do
> > make it
> > look nicer. So while running two (or more) mpi process per GPU will
> > increase GPU utilization, if you don't have the cpu cores to run
> > the
> > process, you will often get a net overall decrease in performance.
> > 
> > My recommendation would be to upgrade your cpus to 16 cores on your
> > 8
> > gpu system so you have atleast 4 cores per GPU. You see the benefit
> > of
> > this in your 5 vs 9 mpi process when using only 4 gpus. In your 9
> > MPI
> > process you are likely using 4 cores/gpu and are likely coming
> > close to
> > taxing your gpus at 100% (especially in 3D classification). That
> > being
> > said, this 16 core cpus are a significant cost increase. This is
> > why we
> > settled on 4 gpu nodes as 8 core cpus can be found at 1/4th the
> > cost.
> > 
> > Cheers,
> > BR
> > 
> > On Fri, 2017-05-12 at 15:24 +0000, Coudray, Nicolas wrote:
> > > Hi,
> > >
> > >   To follow up with the tests of Relion on our 8GPU Titan X
> > Pascal,
> > > below are the results on the benchmark data set (all done with "
> > > --dont_combine_weights_via_disc --no_parallel_disc_io --
> > > preread_images  --pool 100"): 
> > >
> > > 2D classification: 
> > > * 4GPUs, 5 MPIs, 6 threads:  9h23
> > > * 4GPUs, 5 MPIs, 12 threads:  9h27
> > > * 4GPUs, 9 MPIs, 6 threads:    7h10
> > > * 4GPUs, 12 MPIs, 3 threads:  6h34
> > >
> > > * 8GPUs, 5 MPIs, 12 threads:  5h36
> > > * 8GPUs, 9 MPIs, 6 threads:    5h17
> > > * 8GPUs, 17 MPIs, 3 threads:  6h26
> > >
> > > 3D classification:
> > > * 4GPUs, 5 MPIs, 6 threads:    3h36
> > > * 4GPUs, 5 MPIs, 12 threads:  3h40
> > > * 4GPUs, 9 MPIs, 6 threads:    2h56
> > > * 4GPUs, 12 MPIs, 3 threads:  3h01
> > >
> > > * 8GPUs, 5 MPIs, 12 threads:  2h51
> > > * 8GPUs, 9 MPIs, 6 threads:    2h53
> > > * 8GPUs, 17 MPIs, 3 threads:  3h26
> > >
> > >
> > >
> > > The impact of the MPI/thread combination is quite different from
> > what
> > > I expected (little gain on 8 GPUs when moving from 5MPI+12threads
> > to
> > > 9MPIs+6threads for example). If you have suggestions/comments
> > that
> > > would improve the performances, please let us know.
> > >
> > >
> > > @Dr Clara Cai: in one of your previous messages, you mentioned
> > your
> > > 3D classification on that benchmark run was completed in 67 min
> > on
> > > your 8x GPU machine. That's impressive and quite a difference
> > with
> > > our results. I would be interested in knowing more about your
> > setting
> > > and the specifications of your GPUs to figure out the differences
> > > with our machine.
> > >
> > > Thanks in advance,
> > > Best,
> > > Nicolas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: Collaborative Computational Project in Electron cryo-
> > Microscopy
> > > [[log in to unmask]] on behalf of Dr. Clara Cai [marketing@SING
> > LEPA
> > > RTICLE.COM]
> > > Sent: Friday, May 05, 2017 3:21 PM
> > > To: [log in to unmask]
> > > Subject: Re: [ccpem] Relion - Tests on a 8 GPU node
> > >
> > > Dear Weiwei
> > >
> > > The attached paper is from 2002. Intel Hyper-threading and Turbo
> > > Boost work quite efficiently as long as there is no
> > oversubscribing.
> > > We have benchmarked RELION2 with and without HT, and saw very
> > close
> > > results. We'd argue that there is no point in disabling HT as
> > long as
> > > you understand that the system has the HT turned on and some of
> > the
> > > cores are virtual cores. 
> > >
> > > As to your question about what CPUs to choose, as long as you
> > have at
> > > least 2 physical cores per GPU, the decision is really about your
> > > budget. With the CPU pricing model, you will need to pay a lot
> > for
> > > the extra 5-10% performance for higher-end models, and with most
> > > processing on GPUs, you will see a minimal boost in overall
> > RELION
> > > performance. 
> > >
> > > Best regards, 
> > >
> > > -Clara
> > >
> > > Dr. Clara Cai
> > > SingleParticle.com
> > > Turnkey GPU workstations/clusters for cryoEM
> > >
> > > On Fri, May 5, 2017 at 10:01 AM, Weiwei Wang <[log in to unmask]
> > elle
> > > r.edu> wrote:
> > > > Hi All,
> > > >
> > > > I noticed that hyper-threading is enabled in Nicolas'
> > > > configuration. Attached is a discussion on the use of hyper-
> > > > threading in HPC (seems done by Dell, from google). I wonder,
> > in
> > > > case of running Relion2 with GPUs, if hyper-threading
> > would makes
> > > > any difference, maybe in optimization of
> > CPU/GPU calculation/data
> > > > transfer? And a related question, how much CPU power is
> > optimal to
> > > > not be a bottle neck when running with 8 fast GPUs like Titan
> > or
> > > > 1080s (some double float calculations are still performed with
> > CPU
> > > > right?).  Thanks a lot for any suggestions!
> > > >
> > > > Best,
> > > > Weiwei
> > > >
> > > > From: Collaborative Computational Project in Electron cryo-
> > > > Microscopy <[log in to unmask]> on behalf of Ali Siavosh-
> > Haghighi
> > > > <[log in to unmask]>
> > > > Sent: Friday, May 5, 2017 12:01 PM
> > > > To: [log in to unmask]
> > > >
> > > > Subject: Re: [ccpem] Relion - Tests on a 8 GPU node
> > > >  
> > > > Hi All,
> > > > At the same time I should add that all the memory on cards
> > fills up
> > > > (11GB per card; either for 8-GPU or 4-GPU assignments).
> > > > ===========================================
> > > > Ali  Siavosh-Haghighi, Ph.D.
> > > > HPC System Administrator
> > > > High Performance Computing Facility
> > > > Medical Center Information Technologies
> > > > NYU Langone Medical Center
> > > > Phone: (646) 501-2907
> > > >  http://wiki.hpc.med.nyu.edu/
> > > > ===========================================
> > > >
> > > >
> > > > > On May 5, 2017, at 11:36 AM, Coudray, Nicolas <Nicolas.Coudra
> > y@ny
> > > > > umc.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thank you all for your feedbacks!
> > > > >
> > > > > We will let you the know the results on the benchmark dataset
> > > > > asap.
> > > > >
> > > > >
> > > > > Regarding the number of MPIs used, there was indeed a typo
> > and I
> > > > > did use "8GPUs, 9MPIs and 6 threads" in the last run of each
> > job
> > > > > (except auto-picking where I did use 8 MPIs).
> > > > >
> > > > > As for the mapping, this what I did:
> > > > > for 2 GPUs, 3MPIs, 24 threads: -gpu "0:1"
> > > > > for 4 GPUs, 3MPIs, 24 threads: -gpu "0,1:2,3"
> > > > > for 4 GPUs, 5MPIs, 24 threads: -gpu "0:1:2:3"
> > > > > for 8 GPUs:                             -gpu ""
> > > > >
> > > > > So for all the tests with 8 GPUs, I did let Relion figure it
> > out,
> > > > > so the loss of performance when going from 5 to 9 MPIs on 8
> > GPUs
> > > > > is puzzling to me.
> > > > >
> > > > >
> > > > > We also noticed that for some jobs GPUs are not reaching 100%
> > > > > utilization. Regarding this, Bharat, you mentioned that you
> > often
> > > > > run 2 mpi processes with at least 2 threads per GPU. Does it
> > only
> > > > > increased the % GPU utilization, or do you also see a
> > consequent
> > > > > speed improvement?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > BTW, regarding the CPUs we've been using, these are the
> > > > > specifications:
> > > > > CPU(s):                48
> > > > > On-line CPU(s) list:   0-47
> > > > > Thread(s) per core:    2
> > > > > Core(s) per socket:    12
> > > > > Socket(s):             2
> > > > > NUMA node(s):          2
> > > > > Vendor ID:             GenuineIntel
> > > > > CPU family:            6
> > > > > Model:                 79
> > > > > Model name:            Intel(R) Xeon(R) CPU E5-2650 v4 @
> > 2.20GHz
> > > > > Stepping:              1
> > > > > CPU MHz:               2499.921
> > > > > BogoMIPS:              4405.38
> > > > > NUMA node0 CPU(s): 0-11,24-35
> > > > > NUMA node1 CPU(s): 12-23,36-47
> > > > >
> > > > > Thanks,
> > > > > Best
> > > > > Nicolas
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > From: Bjoern Forsberg [[log in to unmask]]
> > > > > Sent: Friday, May 05, 2017 4:51 AM
> > > > > To: Coudray, Nicolas; [log in to unmask]
> > > > > Subject: Re: [ccpem] Relion - Tests on a 8 GPU node
> > > > >
> > > > > Hi Nicolas,
> > > > >
> > > > > Just to clarify, is 8 MPIs a typo? I see you ran 3 and 5 MPIs
> > on
> > > > > some
> > > > > runs with 2 and 4 GPUs, presumably to accommodate the master
> > > > > rank. So I
> > > > > would expect you ran 9 MPIs on some of the 8 GPU-runs, right?
> > One
> > > > > reason
> > > > > I'm asking is that I would expect performance to increase
> > between
> > > > > these
> > > > > runs, e.g.;
> > > > >
> > > > > 8 GPUs, 5 MPIs, 12 threads:   4h15
> > > > > 8 GPUs, 8 MPIs,  6 threads:   8h47
> > > > >
> > > > > But you see a detrimental loss of performance. If you did
> > > > > actually run 8
> > > > > MPIs, that might be why. Running
> > > > >
> > > > > 8 GPUs, 9 MPIs,  6 threads:   ?
> > > > >
> > > > > should be interesting, an more relevant for performance, in
> > that
> > > > > case.
> > > > >
> > > > > Also, did you specify --gpu without any device numbers in all
> > > > > cases? If
> > > > > you did specify GPU indices, performance is fairly sensitive
> > to
> > > > > how well
> > > > > you mapped the ranks and threads to GPUs. This is why we
> > > > > typically
> > > > > advise to *not* specify which GPUs to use and let relion
> > figure
> > > > > it out
> > > > > on it's own, unless you want/need to specify something in
> > > > > particular.
> > > > >
> > > > > Thanks for sharing!
> > > > >
> > > > > /Björn
> > > > >
> > > > > On 05/04/2017 10:13 PM, Coudray, Nicolas wrote:
> > > > > > Hi all,
> > > > > >
> > > > > >
> > > > > > We have been running and testing Relion 2.0 on our 8 GPU
> > nodes
> > > > > > to try to figure out the optimal parameters. We thought
> > these
> > > > > > results might be interesting to share, and we are looking
> > for
> > > > > > any suggestions / comments / similar tests that you could
> > > > > > provide.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Our configuration is:
> > > > > > 8 GPUs node, 48 slots, TITAN X (Pascal, 750 GB RAM, 11GB
> > on-
> > > > > > card), sRelion compiled with gcc 4.8.5, kermel 3.10, Centos
> > > > > > 7.3, sd hard drive of 4TB
> > > > > >
> > > > > > At first, we only varied the number of GPUs, MPIs and
> > threads,
> > > > > > leaving the other disk access options constant (No parallel
> > > > > > disc I/O, particles pre-read into the RAM, no "combine
> > > > > > iterations"). The results for each type of job are as
> > follow:
> > > > > >
> > > > > > *** 2D Classification (265k particles of 280x280 pixels, 5
> > > > > > rounds, 50 classes):
> > > > > > 2 GPUs, 3 MPIs, 24 threads: 14h04
> > > > > > 4 GPUs, 3 MPIs, 24 threads:   5h23
> > > > > > 4 GPUs, 5 MPIs, 12 threads: 13h28
> > > > > > 8 GPUs, 3 MPIs, 24 threads:   3h14
> > > > > > 8 GPUs, 5 MPIs, 12 threads:   5h10
> > > > > > 8 GPUs, 8 MPIs,   6 threads: 13h28
> > > > > >
> > > > > >
> > > > > > *** 3D Classification (226k particles  of 280x280 pixels, 5
> > > > > > rounds, 5 classes):
> > > > > > 2 GPUs, 3 MPIs, 24 threads: 15h17
> > > > > > 4 GPUs, 3 MPIs, 24 threads:   5h53
> > > > > > 4 GPUs, 5 MPIs, 12 threads:   8h11
> > > > > > 8 GPUs, 3 MPIs, 24 threads:   2h48
> > > > > > 8 GPUs, 5 MPIs, 12 threads:   3h16
> > > > > > 8 GPUs, 8 MPIs,   6 threads:   4h37
> > > > > >
> > > > > >
> > > > > > *** 3D Refinement (116k particles of 280x280 pixels):
> > > > > > 2 GPUs, 3 MPIs, 24 threads: 12h07
> > > > > > 4 GPUs, 3 MPIs, 24 threads:   4h54
> > > > > > 4 GPUs, 5 MPIs, 12 threads:   9h12
> > > > > > 8 GPUs, 3 MPIs, 24 threads:   4h57
> > > > > > 8 GPUs, 5 MPIs, 12 threads:   4h15
> > > > > > 8 GPUs, 8 MPIs,   6 threads:   8h47
> > > > > >
> > > > > >
> > > > > > *** Auto-picking (on 2600 micrographs, generating around
> > 750 k
> > > > > > particles):
> > > > > > 0 GPU , 48 threads: 79 min
> > > > > > 2 GPUs,   2 threads: 52 min
> > > > > > 2 GPUs, 48 threads: error (code 11)
> > > > > > 4 GPUs,   4 threads: 27 min
> > > > > > 4 GPUs, 48 threads: 8 min
> > > > > > 8 GPUs,   8 threads: 19 min
> > > > > > 8 GPUs, 48 threads: 12 min
> > > > > >
> > > > > >
> > > > > >
> > > > > > Does anyone have particular suggestions / feedback?
> > > > > >
> > > > > > We are using these tests to guide the expansion of our
> > > > > > centralized GPU capability and any comment is greatly
> > welcome.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Best,
> > > > > >
> > > > > > Nicolas Coudray
> > > > > > New York University
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------
> > ---
> > > > > > This email message, including any attachments, is for the
> > sole
> > > > > > use of the intended recipient(s) and may contain
> > information
> > > > > > that is proprietary, confidential, and exempt from
> > disclosure
> > > > > > under applicable law. Any unauthorized review, use,
> > disclosure,
> > > > > > or distribution is prohibited. If you have received this
> > email
> > > > > > in error please notify the sender by return email and
> > delete
> > > > > > the original message. Please note, the recipient should
> > check
> > > > > > this email and any attachments for the presence of viruses.
> > The
> > > > > > organization accepts no liability for any damage caused by
> > any
> > > > > > virus transmitted by this email.
> > > > > > =================================
> > > > >  
> > > > >
> > > > > ------------------------------------------------------------
> > > > > This email message, including any attachments, is for the
> > sole
> > > > > use of the intended recipient(s) and may contain information
> > that
> > > > > is proprietary, confidential, and exempt from disclosure
> > under
> > > > > applicable law. Any unauthorized review, use, disclosure, or
> > > > > distribution is prohibited. If you have received this email
> > in
> > > > > error please notify the sender by return email and delete the
> > > > > original message. Please note, the recipient should check
> > this
> > > > > email and any attachments for the presence of viruses. The
> > > > > organization accepts no liability for any damage caused by
> > any
> > > > > virus transmitted by this email.
> > > > > =================================
> > > >
> > > >
> > >
> > >
> > > ------------------------------------------------------------
> > > This email message, including any attachments, is for the sole
> > use of
> > > the intended recipient(s) and may contain information that is
> > > proprietary, confidential, and exempt from disclosure under
> > > applicable law. Any unauthorized review, use, disclosure, or
> > > distribution is prohibited. If you have received this email in
> > error
> > > please notify the sender by return email and delete the original
> > > message. Please note, the recipient should check this email and
> > any
> > > attachments for the presence of viruses. The organization accepts
> > no
> > > liability for any damage caused by any virus transmitted by this
> > > email.
> > > =================================
> > 
> 
>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options