JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for CCPEM Archives


CCPEM Archives

CCPEM Archives


CCPEM@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

CCPEM Home

CCPEM Home

CCPEM  March 2017

CCPEM March 2017

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: geforce vs tesla

From:

Bharat Reddy <[log in to unmask]>

Reply-To:

Bharat Reddy <[log in to unmask]>

Date:

Mon, 6 Mar 2017 11:51:12 -0600

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (180 lines)

Hi Dominik,

Did you look at the clock speed of your GPU's during the run? The fact
you said you had 80C peaks leads me to believe your GPUs might be
throttling themselves. The consumer based founders editions cards I've
used have horrible default fan settings. They prioritize noise over
performance. This in turn causes your cards to start throttling and can
hurt your performance by 5-20%. In order to bypass the default gpu fan
settings you need to enable coolbits. The problem is that cool bits
only works on screens that have X sessions running on them. To by pass
this problem the following github project by Boris Dimitrov based on
the work of Axel Kohlmeyer solves it: https://github.com/boris-dimitrov
/set_gpu_fans_public . It sets up dummy screens and X sessions on each
gpu and has a nifty script to automatically ramp up and down the gpu
fan as needed based on gpu temperature. I used this on all our
workstations and this helps keep the gpus running at nearly top speeds
at all times and overall temperature below 78C. The only drawback is
noise. 

Cheers,
BR

_________________________________
Bharat Reddy, Ph.D.
Perozo Lab, University of Chicago
Email: [log in to unmask]
Tel: (773) 834 - 4734
 


On Mon, 2017-03-06 at 17:12 +0100, Dominik A. Herbst wrote:
> Hi Benoit,
> 
> Approx. half a year go we bought together with our HPC team six GPU
> nodes from DALCO (dual Xeon E5-2680, 512 GB RAM, 1.6 TB Intel NVMe
> PCIe SSD, 4xTitan X, Infiniband, 2 HE chassis, CentOS 7).
> Before we bought them we were running Relion2 jobs on 4x GTX 1080 GPU
> workstations (DALCO, Samsung M2 NVMe 950 Pro, i7-6900K, 64-128 GB
> RAM, 1 GBit/s Ethernet, CentOS 6) as described on Erik Lindahl's
> homepage.
> 
> I did plenty of benchmarking. All tests were done using the same data
> set and random seed. In all cases the Titan-X GPU nodes showed
> approx. 25 % higher performance, which is in agreement to literature.
> (The workstations with 4x GTX 1080 had a performance of approx. 10
> cluster nodes with dual Xeon E5-2650, 64 GB RAM, infiniband (~ 320
> cores / no GPUs))
> The 2 HE GPU node chassis/boards provide 8 PCIe slots, whereof 4 are
> used for TitanXs and one for the scratch SSD. In order to check how
> performance scales with more TitanX we equipped one node with 7
> TitanX. I was running benchmarks using the scratch with the same
> random seed and the same data set (120,000 ptcls, 210px box).
> 
> 
> Results for the usage of all possible resources for 7, 6 and 4 GPUs
> on one GPU node:
> (rpn = MPI ranks per node; ppr = processes per rank / threads; gpu =
> gpus)
> 
> rpn14_ppr2_gpu7: #28 slots
> real    23m46.751s
> user    309m5.491s
> sys    50m48.316s
> rpn8_ppr3_gpu7: #24 slots
> real    24m42.213s
> user    234m41.726s
> sys    41m33.935s
>  
> rpn13_ppr2_gpu6: #26 slots
> real    25m6.495s
> user    312m3.790s
> sys    52m24.046s
> rpn7_ppr4_gpu6:#28 slots
> real    27m9.320s
> user    255m1.701s
> sys    48m11.140s
> 
> rpn9_ppr3_gpu4: #27 slots
> real    32m54.921s
> user    370m40.358s
> sys    61m44.317s
> rpn5_ppr5_gpu4: #25 slots
> real    38m54.711s
> user    283m0.446s
> sys    55m42.250s
> 
> With more than 4 GPUs the GPUs were never running with full
> utilization (>90%), but in a range of 50-70%.
> Based on a direct comparison (real time improvement):
> GPUs used (relative)
> performance increase for 1 rank/GPU	performance increase for 2
> ranks/GPU
> 4-->6 = 2 more
> 30.2% (15%/GPU)	23.8% (12%/GPU)
> 4-->7 = 3 more
> 36.5% (12%/GPU)	27.8% (9%/GPU)
> 
> This tells me that 6 GPUs scales better than 7 on this 28 core
> machine, which is why we plan to upgrade all nodes to 6 TitanX.
> 
> Surprisingly, the temperature was very moderate (~ 60-70°C, 80°C at
> the peak) despite of the high packing density, but it might be that
> our HPC team did some chassis fan tuning.
> 
> Currently we are using the Univa Grid Engine, which comes with some
> problems for running hybrid smp-mpi jobs, but it works.
> Unfortunately, UGE (/SGE) cannot run hybrid smp-mpi-gpu jobs on
> several nodes, which limits your job request to one node.
> If you want to use GPUs on several nodes, Slurm is a better choice
> and we will switch to it soon.
> 
> However, in our case we had severe issues with core binding of mpi
> processes. Often all of them were bound to the first cores, even when
> a second job was started they ended up on the same cores (!!!),
> unless mpirun was started with the "--bind-to none" parameter.
> Furthermore, I recommend to provide a $GPU_ASSIGN variable with your
> Relion2 (module-)installation, that generates the --gpu string from
> the SGE variables ($SGE_HGR_gpu_dev, $NSLOTS and $OMP_NUM_THREADS).
> If you like, I can provide you the bash script.
> In my opinion this is particularly important, because if the --gpu
> X,x,x,x:Y,y,y,y:... parameter is not set, Relion2 will use ALL
> resources and distribute the job itself. This is particularly bad if
> a second job will be started on the same node, because the two jobs
> will compete for the same resources and once one job took all the GPU
> memory, the other job will die. Note that the --j and the --gpu
> parameter work differently: --j takes only what you assign (perfekt
> for queueing systems), --gpu takes everything it can get, unless you
> restrict it (not ideal for queueing systems).
> 
> Concerning the OS, please note that the Nvidia drivers for the Pascal
> cards are not well supported by CentOS6/RHEL6 and you might want to
> switch to CentOS7/RHEL7.
> 
> I hope this helps!
> 
> Best,
> Dominik
> 
> 
> 
> 
> On 03/06/2017 12:00 PM, Beno ît Zuber wrote:
> > Hi Daniel, Erki, and Masahide,
> >  
> > Thank you for your feedback!
> >  
> > Best
> > Benoit
> >  
> > De : "[log in to unmask]" <[log in to unmask]>
> > Date : lundi, 6 mars 2017 à 08:29
> > À : Benoît Zuber <[log in to unmask]>
> > Objet : AW: geforce vs tesla
> >  
> > Hallo Benoit,
> > 
> > ETH ist currently setting up a cluster with NVidia GTX1080 GPUs for
> > big data (https://scicomp.ethz.ch/wiki/Leonhard). We could not test
> > it yet but Relion2 should run on the GPU nodes. 
> > Best,
> > Daniel
> > Von: Collaborative Computational Project in Electron cryo-
> > Microscopy [[log in to unmask]]" im Auftrag von "Benoît Zuber [be
> > [log in to unmask]]
> > Gesendet: Montag, 6. März 2017 06:58
> > An: [log in to unmask]
> > Betreff: [ccpem] geforce vs tesla
> > 
> > Hello,
> >  
> > Our HPC cluster team is collecting wishes before building up a new
> > GPU cluster. They consider implementing Tesla or Geforce cards.
> > With the new 1080 Ti card and its 11 Gb RAM, is there any reason to
> > go for Tesla cards when considering performance for Relion 2?
> >  
> > Thanks for your input
> > Benoit
> >  
>  

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

May 2024
April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager