JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for CCPEM Archives


CCPEM Archives

CCPEM Archives


CCPEM@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

CCPEM Home

CCPEM Home

CCPEM  March 2017

CCPEM March 2017

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: geforce vs tesla

From:

Ross Walker <[log in to unmask]>

Reply-To:

Ross Walker <[log in to unmask]>

Date:

Mon, 6 Mar 2017 16:00:05 -0700

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (405 lines)

Hi Bharat,

Which versions of the 780s do you have? Are they the reference design or some third party OEM design? Experience has shown that the reference design cards are much more reliable long term than the various 'gamer-mod' cards that one can buy. The later tend to have far cheaper components, plastic fans etc. I think the standard warranty on Geforce cards from the likes of Amazon is either 90 days or 1 year depending on the card manufacturer - it may be 2 years in the EU. So after 3 years you would be out of luck. If bought from a vendor as a warrantied system you'd normally be good for 3 years by default. The Exxact machines for example are 3 years return to base by default so one would just send back the faulty cards and get replacements. One can pay more for next business day onsite or for extended, say 5 years warranty. That's probably the primary benefit buying prebuilt machines from a vendor (that has a long history of doing business in the field) rather than building them yourself from parts at which point you'd be on your own warranty wise.

Note the Titan cards - that is Titan, Titan-Black, Titan-X and Titan-XP tend to have much better longevity. I know of several large pharma companies running big clusters of the original Titan cards and they've seen on the order of 1 or 2 failures out of 60+ cards over 3 years. The difference is the Titan cards are made on the same production line as the tesla cards are made and to much more exacting specifications. This is very different from the 780, 980, 1080 series where different OEMs can change everything from memory spec, clock speeds, heat sinks etc. They look cool with those huge fans and fancy UV lights but ultimately that all just ups the failure rate.

My advice, having been designing and building GeForce based workstations and clusters for a decade now is to only ever use the reference design (now called founders edition) cards. Everything else is just gambling reliability wise.

All the best
Ross


> On Mar 6, 2017, at 15:21, Bharat Reddy <[log in to unmask]> wrote:
> 
> Hi All,
> 
> We have an old 20+ node gpu cluster consisting of 4x Nvidia 780s and
> now that they are about +3 years old, we are constantly having the gpu
> cards die on us on a monthly if not weekly basis. How is the
> reliability of the Nvidia Titans? Have any died so far? If so how was
> your vendor's support with the Titans?
> 
> Cheers,
> BR  
> 
> _________________________________
> Bharat Reddy, Ph.D.
> Perozo Lab, University of Chicago
> Email: [log in to unmask]
> Tel: (773) 834 - 4734
> 
> On Mon, 2017-03-06 at 22:12 +0100, Dominik A. Herbst wrote:
>> Hi all,
>> True, a workstation is much cheaper, but is it used more efficiently?
>> We started with three workstations, one for everybody who was
>> processing data at that time. If you wanted to start several jobs in
>> parallel, you had to start them individually on every machine and
>> sometimes you also had to ask the person who is usually using it for
>> permission, because he/she might want to use it too. In the end this
>> was very inefficient, because there were always unused resources and
>> you always had to figure out where you could start a job, unless you
>> were starting them on your own workstation (one after the other).
>> Since we have the six GPU nodes, the workstations are used mostly for
>> testing or as normal desktop workstations, because it is much easier
>> to send a job to the queue, than figuring out where to start jobs
>> yourself.
>> Now, our six nodes serve the needs of 3 research groups. They are
>> running 7/24 and the queue is rarely long. Moreover, you don't have
>> to deal with noise and heat in the office.
>> If you have a server room, I would always go for central resources
>> that can be used by everybody who needs them and that are managed by
>> a queuing system. Like that you can serve the needs of 10 (?) people
>> with 3-4 nodes, which might be even cheaper than buying 10
>> workstations. Of course you could build a mini-cluster from a bunch
>> of workstations, but there are obvious reasons why servers are more
>> expensive than workstations.
>> On the other hand, if there are just a few people running jobs, an
>> individual workstation might be a better option.
>> In the end you have to select the right device for the right purpose.
>> Concerning our nodes, we are perfectly happy. Everybody can use them
>> without complications and they are heating the basement instead of
>> our offices.
>> Sorry, what I wrote might have been misleading. The TitanX cards run
>> at a temperature between 60-70°C, when I was running jobs. The
>> highest temperature I ever observed were exceptional 80°C. I did not
>> check for the clock speed, because they are usually not getting hot.
>> Some time ago we had a case with a TeslaK80 node, which was getting
>> hot. Afterwards our HPC team adjusted the chassis fan speed and they
>> never got hot again.
>> 
>>> We are running Slurm and have found compiling OpenMPI against PMI2
>>> libraries essential for distributing MPI ranks across multiple GPU
>>> nodes.  Then the RELION commands are launched using srun --mpi=pmi2 
>>> rather than mpirun.
>> 
>> @David and Craig: Thanks! That's good to know! We are going to switch
>> to Slurm in a couple of weeks. I am looking forward.
>> 
>> Best,
>> Dominik
>> 
>> On 03/06/2017 08:09 PM, Bharat Reddy wrote:
>>> Hi Ross,
>>> 
>>> I agree with the Exxact recommendation, but that being said with
>>> the
>>> latest Nvidia GPU price drops, you can build a 4x 1080GTX gpu
>>> workstation for about $4-5K. At those prices, you can almost get 2
>>> workstations for the price of one of those Exxact gpu nodes.
>>> 
>>> Cheers,
>>> BR 
>>> 
>>> On Mon, 2017-03-06 at 11:06 -0700, Ross Walker wrote:
>>>> Hi All,
>>>> 
>>>> I'll add here that you can also purchase ducted server cases for
>>>> these cards, rather than trying to use desktop cases for what are
>>>> really 24/7 servers, and avoid all of the issues associated with
>>>> having to mess with fan speeds etc.
>>>> 
>>>> For example these systems: https://exxactcorp.com/index.php/solut
>>>> ion/
>>>> solu_detail/314  use ducted fans (the same that are used for
>>>> passive
>>>> cards) and thus do a much better job of cooling the GPUs and
>>>> avoid
>>>> most of the issues with clocking down so you don't need to worry
>>>> about setting coolbits etc. They are VERY loud though so should
>>>> only
>>>> really be considered for use in machine rooms. Ultimately these
>>>> systems are better on the GPUs though - those GPU fans are not
>>>> designed to run at 100% 24/7.
>>>> 
>>>> All the best
>>>> Ross
>>>> 
>>>>> On Mar 6, 2017, at 10:51, Bharat Reddy <000009a7465b91d2-dmarc-
>>>>> requ
>>>>> [log in to unmask]> wrote:
>>>>> 
>>>>> Hi Dominik,
>>>>> 
>>>>> Did you look at the clock speed of your GPU's during the run?
>>>>> The
>>>>> fact
>>>>> you said you had 80C peaks leads me to believe your GPUs might
>>>>> be
>>>>> throttling themselves. The consumer based founders editions
>>>>> cards
>>>>> I've
>>>>> used have horrible default fan settings. They prioritize noise
>>>>> over
>>>>> performance. This in turn causes your cards to start throttling
>>>>> and
>>>>> can
>>>>> hurt your performance by 5-20%. In order to bypass the default
>>>>> gpu
>>>>> fan
>>>>> settings you need to enable coolbits. The problem is that cool
>>>>> bits
>>>>> only works on screens that have X sessions running on them. To
>>>>> by
>>>>> pass
>>>>> this problem the following github project by Boris Dimitrov
>>>>> based
>>>>> on
>>>>> the work of Axel Kohlmeyer solves it: https://github.com/boris-
>>>>> dimi
>>>>> trov
>>>>> /set_gpu_fans_public . It sets up dummy screens and X sessions
>>>>> on
>>>>> each
>>>>> gpu and has a nifty script to automatically ramp up and down
>>>>> the
>>>>> gpu
>>>>> fan as needed based on gpu temperature. I used this on all our
>>>>> workstations and this helps keep the gpus running at nearly top
>>>>> speeds
>>>>> at all times and overall temperature below 78C. The only
>>>>> drawback
>>>>> is
>>>>> noise. 
>>>>> 
>>>>> Cheers,
>>>>> BR
>>>>> 
>>>>> _________________________________
>>>>> Bharat Reddy, Ph.D.
>>>>> Perozo Lab, University of Chicago
>>>>> Email: [log in to unmask]
>>>>> Tel: (773) 834 - 4734
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, 2017-03-06 at 17:12 +0100, Dominik A. Herbst wrote:
>>>>>> Hi Benoit,
>>>>>> 
>>>>>> Approx. half a year go we bought together with our HPC team
>>>>>> six
>>>>>> GPU
>>>>>> nodes from DALCO (dual Xeon E5-2680, 512 GB RAM, 1.6 TB Intel
>>>>>> NVMe
>>>>>> PCIe SSD, 4xTitan X, Infiniband, 2 HE chassis, CentOS 7).
>>>>>> Before we bought them we were running Relion2 jobs on 4x GTX
>>>>>> 1080
>>>>>> GPU
>>>>>> workstations (DALCO, Samsung M2 NVMe 950 Pro, i7-6900K, 64-
>>>>>> 128 GB
>>>>>> RAM, 1 GBit/s Ethernet, CentOS 6) as described on Erik
>>>>>> Lindahl's
>>>>>> homepage.
>>>>>> 
>>>>>> I did plenty of benchmarking. All tests were done using the
>>>>>> same
>>>>>> data
>>>>>> set and random seed. In all cases the Titan-X GPU nodes
>>>>>> showed
>>>>>> approx. 25 % higher performance, which is in agreement to
>>>>>> literature.
>>>>>> (The workstations with 4x GTX 1080 had a performance of
>>>>>> approx.
>>>>>> 10
>>>>>> cluster nodes with dual Xeon E5-2650, 64 GB RAM, infiniband
>>>>>> (~
>>>>>> 320
>>>>>> cores / no GPUs))
>>>>>> The 2 HE GPU node chassis/boards provide 8 PCIe slots,
>>>>>> whereof 4
>>>>>> are
>>>>>> used for TitanXs and one for the scratch SSD. In order to
>>>>>> check
>>>>>> how
>>>>>> performance scales with more TitanX we equipped one node with
>>>>>> 7
>>>>>> TitanX. I was running benchmarks using the scratch with the
>>>>>> same
>>>>>> random seed and the same data set (120,000 ptcls, 210px box).
>>>>>> 
>>>>>> 
>>>>>> Results for the usage of all possible resources for 7, 6 and
>>>>>> 4
>>>>>> GPUs
>>>>>> on one GPU node:
>>>>>> (rpn = MPI ranks per node; ppr = processes per rank /
>>>>>> threads;
>>>>>> gpu =
>>>>>> gpus)
>>>>>> 
>>>>>> rpn14_ppr2_gpu7: #28 slots
>>>>>> real    23m46.751s
>>>>>> user    309m5.491s
>>>>>> sys    50m48.316s
>>>>>> rpn8_ppr3_gpu7: #24 slots
>>>>>> real    24m42.213s
>>>>>> user    234m41.726s
>>>>>> sys    41m33.935s
>>>>>>  
>>>>>> rpn13_ppr2_gpu6: #26 slots
>>>>>> real    25m6.495s
>>>>>> user    312m3.790s
>>>>>> sys    52m24.046s
>>>>>> rpn7_ppr4_gpu6:#28 slots
>>>>>> real    27m9.320s
>>>>>> user    255m1.701s
>>>>>> sys    48m11.140s
>>>>>> 
>>>>>> rpn9_ppr3_gpu4: #27 slots
>>>>>> real    32m54.921s
>>>>>> user    370m40.358s
>>>>>> sys    61m44.317s
>>>>>> rpn5_ppr5_gpu4: #25 slots
>>>>>> real    38m54.711s
>>>>>> user    283m0.446s
>>>>>> sys    55m42.250s
>>>>>> 
>>>>>> With more than 4 GPUs the GPUs were never running with full
>>>>>> utilization (>90%), but in a range of 50-70%.
>>>>>> Based on a direct comparison (real time improvement):
>>>>>> GPUs used (relative)
>>>>>> performance increase for 1 rank/GPU	performance
>>>>>> increase
>>>>>> for 2
>>>>>> ranks/GPU
>>>>>> 4-->6 = 2 more
>>>>>> 30.2% (15%/GPU)	23.8% (12%/GPU)
>>>>>> 4-->7 = 3 more
>>>>>> 36.5% (12%/GPU)	27.8% (9%/GPU)
>>>>>> 
>>>>>> This tells me that 6 GPUs scales better than 7 on this 28
>>>>>> core
>>>>>> machine, which is why we plan to upgrade all nodes to 6
>>>>>> TitanX.
>>>>>> 
>>>>>> Surprisingly, the temperature was very moderate (~ 60-70°C,
>>>>>> 80°C
>>>>>> at
>>>>>> the peak) despite of the high packing density, but it might
>>>>>> be
>>>>>> that
>>>>>> our HPC team did some chassis fan tuning.
>>>>>> 
>>>>>> Currently we are using the Univa Grid Engine, which comes
>>>>>> with
>>>>>> some
>>>>>> problems for running hybrid smp-mpi jobs, but it works.
>>>>>> Unfortunately, UGE (/SGE) cannot run hybrid smp-mpi-gpu jobs
>>>>>> on
>>>>>> several nodes, which limits your job request to one node.
>>>>>> If you want to use GPUs on several nodes, Slurm is a better
>>>>>> choice
>>>>>> and we will switch to it soon.
>>>>>> 
>>>>>> However, in our case we had severe issues with core binding
>>>>>> of
>>>>>> mpi
>>>>>> processes. Often all of them were bound to the first cores,
>>>>>> even
>>>>>> when
>>>>>> a second job was started they ended up on the same cores
>>>>>> (!!!),
>>>>>> unless mpirun was started with the "--bind-to none"
>>>>>> parameter.
>>>>>> Furthermore, I recommend to provide a $GPU_ASSIGN variable
>>>>>> with
>>>>>> your
>>>>>> Relion2 (module-)installation, that generates the --gpu
>>>>>> string
>>>>>> from
>>>>>> the SGE variables ($SGE_HGR_gpu_dev, $NSLOTS and
>>>>>> $OMP_NUM_THREADS).
>>>>>> If you like, I can provide you the bash script.
>>>>>> In my opinion this is particularly important, because if the
>>>>>> --
>>>>>> gpu
>>>>>> X,x,x,x:Y,y,y,y:... parameter is not set, Relion2 will use
>>>>>> ALL
>>>>>> resources and distribute the job itself. This is particularly
>>>>>> bad
>>>>>> if
>>>>>> a second job will be started on the same node, because the
>>>>>> two
>>>>>> jobs
>>>>>> will compete for the same resources and once one job took all
>>>>>> the
>>>>>> GPU
>>>>>> memory, the other job will die. Note that the --j and the --
>>>>>> gpu
>>>>>> parameter work differently: --j takes only what you assign
>>>>>> (perfekt
>>>>>> for queueing systems), --gpu takes everything it can get,
>>>>>> unless
>>>>>> you
>>>>>> restrict it (not ideal for queueing systems).
>>>>>> 
>>>>>> Concerning the OS, please note that the Nvidia drivers for
>>>>>> the
>>>>>> Pascal
>>>>>> cards are not well supported by CentOS6/RHEL6 and you might
>>>>>> want
>>>>>> to
>>>>>> switch to CentOS7/RHEL7.
>>>>>> 
>>>>>> I hope this helps!
>>>>>> 
>>>>>> Best,
>>>>>> Dominik
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 03/06/2017 12:00 PM, Beno ît Zuber wrote:
>>>>>>> Hi Daniel, Erki, and Masahide,
>>>>>>>  
>>>>>>> Thank you for your feedback!
>>>>>>>  
>>>>>>> Best
>>>>>>> Benoit
>>>>>>>  
>>>>>>> De : "[log in to unmask]" <[log in to unmask]
>>>>>>> z.ch
>>>>>>> 
>>>>>>> Date : lundi, 6 mars 2017 à 08:29
>>>>>>> À : Benoît Zuber <[log in to unmask]>
>>>>>>> Objet : AW: geforce vs tesla
>>>>>>>  
>>>>>>> Hallo Benoit,
>>>>>>> 
>>>>>>> ETH ist currently setting up a cluster with NVidia GTX1080
>>>>>>> GPUs
>>>>>>> for
>>>>>>> big data (https://scicomp.ethz.ch/wiki/Leonhard). We could
>>>>>>> not
>>>>>>> test
>>>>>>> it yet but Relion2 should run on the GPU nodes. 
>>>>>>> Best,
>>>>>>> Daniel
>>>>>>> Von: Collaborative Computational Project in Electron cryo-
>>>>>>> Microscopy [[log in to unmask]]" im Auftrag von "Benoît
>>>>>>> Zuber
>>>>>>> [be
>>>>>>> [log in to unmask]]
>>>>>>> Gesendet: Montag, 6. März 2017 06:58
>>>>>>> An: [log in to unmask]
>>>>>>> Betreff: [ccpem] geforce vs tesla
>>>>>>> 
>>>>>>> Hello,
>>>>>>>  
>>>>>>> Our HPC cluster team is collecting wishes before building
>>>>>>> up a
>>>>>>> new
>>>>>>> GPU cluster. They consider implementing Tesla or Geforce
>>>>>>> cards.
>>>>>>> With the new 1080 Ti card and its 11 Gb RAM, is there any
>>>>>>> reason to
>>>>>>> go for Tesla cards when considering performance for Relion
>>>>>>> 2?
>>>>>>>  
>>>>>>> Thanks for your input
>>>>>>> Benoit
>>>>>>>  
>>>>>> 
>>>>>>  
>>  

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager