Hi Bharat,
Which versions of the 780s do you have? Are they the reference design or some third party OEM design? Experience has shown that the reference design cards are much more reliable long term than the various 'gamer-mod' cards that one can buy. The later tend to have far cheaper components, plastic fans etc. I think the standard warranty on Geforce cards from the likes of Amazon is either 90 days or 1 year depending on the card manufacturer - it may be 2 years in the EU. So after 3 years you would be out of luck. If bought from a vendor as a warrantied system you'd normally be good for 3 years by default. The Exxact machines for example are 3 years return to base by default so one would just send back the faulty cards and get replacements. One can pay more for next business day onsite or for extended, say 5 years warranty. That's probably the primary benefit buying prebuilt machines from a vendor (that has a long history of doing business in the field) rather than building them yourself from parts at which point you'd be on your own warranty wise.
Note the Titan cards - that is Titan, Titan-Black, Titan-X and Titan-XP tend to have much better longevity. I know of several large pharma companies running big clusters of the original Titan cards and they've seen on the order of 1 or 2 failures out of 60+ cards over 3 years. The difference is the Titan cards are made on the same production line as the tesla cards are made and to much more exacting specifications. This is very different from the 780, 980, 1080 series where different OEMs can change everything from memory spec, clock speeds, heat sinks etc. They look cool with those huge fans and fancy UV lights but ultimately that all just ups the failure rate.
My advice, having been designing and building GeForce based workstations and clusters for a decade now is to only ever use the reference design (now called founders edition) cards. Everything else is just gambling reliability wise.
All the best
Ross
> On Mar 6, 2017, at 15:21, Bharat Reddy <[log in to unmask]> wrote:
>
> Hi All,
>
> We have an old 20+ node gpu cluster consisting of 4x Nvidia 780s and
> now that they are about +3 years old, we are constantly having the gpu
> cards die on us on a monthly if not weekly basis. How is the
> reliability of the Nvidia Titans? Have any died so far? If so how was
> your vendor's support with the Titans?
>
> Cheers,
> BR
>
> _________________________________
> Bharat Reddy, Ph.D.
> Perozo Lab, University of Chicago
> Email: [log in to unmask]
> Tel: (773) 834 - 4734
>
> On Mon, 2017-03-06 at 22:12 +0100, Dominik A. Herbst wrote:
>> Hi all,
>> True, a workstation is much cheaper, but is it used more efficiently?
>> We started with three workstations, one for everybody who was
>> processing data at that time. If you wanted to start several jobs in
>> parallel, you had to start them individually on every machine and
>> sometimes you also had to ask the person who is usually using it for
>> permission, because he/she might want to use it too. In the end this
>> was very inefficient, because there were always unused resources and
>> you always had to figure out where you could start a job, unless you
>> were starting them on your own workstation (one after the other).
>> Since we have the six GPU nodes, the workstations are used mostly for
>> testing or as normal desktop workstations, because it is much easier
>> to send a job to the queue, than figuring out where to start jobs
>> yourself.
>> Now, our six nodes serve the needs of 3 research groups. They are
>> running 7/24 and the queue is rarely long. Moreover, you don't have
>> to deal with noise and heat in the office.
>> If you have a server room, I would always go for central resources
>> that can be used by everybody who needs them and that are managed by
>> a queuing system. Like that you can serve the needs of 10 (?) people
>> with 3-4 nodes, which might be even cheaper than buying 10
>> workstations. Of course you could build a mini-cluster from a bunch
>> of workstations, but there are obvious reasons why servers are more
>> expensive than workstations.
>> On the other hand, if there are just a few people running jobs, an
>> individual workstation might be a better option.
>> In the end you have to select the right device for the right purpose.
>> Concerning our nodes, we are perfectly happy. Everybody can use them
>> without complications and they are heating the basement instead of
>> our offices.
>> Sorry, what I wrote might have been misleading. The TitanX cards run
>> at a temperature between 60-70°C, when I was running jobs. The
>> highest temperature I ever observed were exceptional 80°C. I did not
>> check for the clock speed, because they are usually not getting hot.
>> Some time ago we had a case with a TeslaK80 node, which was getting
>> hot. Afterwards our HPC team adjusted the chassis fan speed and they
>> never got hot again.
>>
>>> We are running Slurm and have found compiling OpenMPI against PMI2
>>> libraries essential for distributing MPI ranks across multiple GPU
>>> nodes. Then the RELION commands are launched using srun --mpi=pmi2
>>> rather than mpirun.
>>
>> @David and Craig: Thanks! That's good to know! We are going to switch
>> to Slurm in a couple of weeks. I am looking forward.
>>
>> Best,
>> Dominik
>>
>> On 03/06/2017 08:09 PM, Bharat Reddy wrote:
>>> Hi Ross,
>>>
>>> I agree with the Exxact recommendation, but that being said with
>>> the
>>> latest Nvidia GPU price drops, you can build a 4x 1080GTX gpu
>>> workstation for about $4-5K. At those prices, you can almost get 2
>>> workstations for the price of one of those Exxact gpu nodes.
>>>
>>> Cheers,
>>> BR
>>>
>>> On Mon, 2017-03-06 at 11:06 -0700, Ross Walker wrote:
>>>> Hi All,
>>>>
>>>> I'll add here that you can also purchase ducted server cases for
>>>> these cards, rather than trying to use desktop cases for what are
>>>> really 24/7 servers, and avoid all of the issues associated with
>>>> having to mess with fan speeds etc.
>>>>
>>>> For example these systems: https://exxactcorp.com/index.php/solut
>>>> ion/
>>>> solu_detail/314 use ducted fans (the same that are used for
>>>> passive
>>>> cards) and thus do a much better job of cooling the GPUs and
>>>> avoid
>>>> most of the issues with clocking down so you don't need to worry
>>>> about setting coolbits etc. They are VERY loud though so should
>>>> only
>>>> really be considered for use in machine rooms. Ultimately these
>>>> systems are better on the GPUs though - those GPU fans are not
>>>> designed to run at 100% 24/7.
>>>>
>>>> All the best
>>>> Ross
>>>>
>>>>> On Mar 6, 2017, at 10:51, Bharat Reddy <000009a7465b91d2-dmarc-
>>>>> requ
>>>>> [log in to unmask]> wrote:
>>>>>
>>>>> Hi Dominik,
>>>>>
>>>>> Did you look at the clock speed of your GPU's during the run?
>>>>> The
>>>>> fact
>>>>> you said you had 80C peaks leads me to believe your GPUs might
>>>>> be
>>>>> throttling themselves. The consumer based founders editions
>>>>> cards
>>>>> I've
>>>>> used have horrible default fan settings. They prioritize noise
>>>>> over
>>>>> performance. This in turn causes your cards to start throttling
>>>>> and
>>>>> can
>>>>> hurt your performance by 5-20%. In order to bypass the default
>>>>> gpu
>>>>> fan
>>>>> settings you need to enable coolbits. The problem is that cool
>>>>> bits
>>>>> only works on screens that have X sessions running on them. To
>>>>> by
>>>>> pass
>>>>> this problem the following github project by Boris Dimitrov
>>>>> based
>>>>> on
>>>>> the work of Axel Kohlmeyer solves it: https://github.com/boris-
>>>>> dimi
>>>>> trov
>>>>> /set_gpu_fans_public . It sets up dummy screens and X sessions
>>>>> on
>>>>> each
>>>>> gpu and has a nifty script to automatically ramp up and down
>>>>> the
>>>>> gpu
>>>>> fan as needed based on gpu temperature. I used this on all our
>>>>> workstations and this helps keep the gpus running at nearly top
>>>>> speeds
>>>>> at all times and overall temperature below 78C. The only
>>>>> drawback
>>>>> is
>>>>> noise.
>>>>>
>>>>> Cheers,
>>>>> BR
>>>>>
>>>>> _________________________________
>>>>> Bharat Reddy, Ph.D.
>>>>> Perozo Lab, University of Chicago
>>>>> Email: [log in to unmask]
>>>>> Tel: (773) 834 - 4734
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 2017-03-06 at 17:12 +0100, Dominik A. Herbst wrote:
>>>>>> Hi Benoit,
>>>>>>
>>>>>> Approx. half a year go we bought together with our HPC team
>>>>>> six
>>>>>> GPU
>>>>>> nodes from DALCO (dual Xeon E5-2680, 512 GB RAM, 1.6 TB Intel
>>>>>> NVMe
>>>>>> PCIe SSD, 4xTitan X, Infiniband, 2 HE chassis, CentOS 7).
>>>>>> Before we bought them we were running Relion2 jobs on 4x GTX
>>>>>> 1080
>>>>>> GPU
>>>>>> workstations (DALCO, Samsung M2 NVMe 950 Pro, i7-6900K, 64-
>>>>>> 128 GB
>>>>>> RAM, 1 GBit/s Ethernet, CentOS 6) as described on Erik
>>>>>> Lindahl's
>>>>>> homepage.
>>>>>>
>>>>>> I did plenty of benchmarking. All tests were done using the
>>>>>> same
>>>>>> data
>>>>>> set and random seed. In all cases the Titan-X GPU nodes
>>>>>> showed
>>>>>> approx. 25 % higher performance, which is in agreement to
>>>>>> literature.
>>>>>> (The workstations with 4x GTX 1080 had a performance of
>>>>>> approx.
>>>>>> 10
>>>>>> cluster nodes with dual Xeon E5-2650, 64 GB RAM, infiniband
>>>>>> (~
>>>>>> 320
>>>>>> cores / no GPUs))
>>>>>> The 2 HE GPU node chassis/boards provide 8 PCIe slots,
>>>>>> whereof 4
>>>>>> are
>>>>>> used for TitanXs and one for the scratch SSD. In order to
>>>>>> check
>>>>>> how
>>>>>> performance scales with more TitanX we equipped one node with
>>>>>> 7
>>>>>> TitanX. I was running benchmarks using the scratch with the
>>>>>> same
>>>>>> random seed and the same data set (120,000 ptcls, 210px box).
>>>>>>
>>>>>>
>>>>>> Results for the usage of all possible resources for 7, 6 and
>>>>>> 4
>>>>>> GPUs
>>>>>> on one GPU node:
>>>>>> (rpn = MPI ranks per node; ppr = processes per rank /
>>>>>> threads;
>>>>>> gpu =
>>>>>> gpus)
>>>>>>
>>>>>> rpn14_ppr2_gpu7: #28 slots
>>>>>> real 23m46.751s
>>>>>> user 309m5.491s
>>>>>> sys 50m48.316s
>>>>>> rpn8_ppr3_gpu7: #24 slots
>>>>>> real 24m42.213s
>>>>>> user 234m41.726s
>>>>>> sys 41m33.935s
>>>>>>
>>>>>> rpn13_ppr2_gpu6: #26 slots
>>>>>> real 25m6.495s
>>>>>> user 312m3.790s
>>>>>> sys 52m24.046s
>>>>>> rpn7_ppr4_gpu6:#28 slots
>>>>>> real 27m9.320s
>>>>>> user 255m1.701s
>>>>>> sys 48m11.140s
>>>>>>
>>>>>> rpn9_ppr3_gpu4: #27 slots
>>>>>> real 32m54.921s
>>>>>> user 370m40.358s
>>>>>> sys 61m44.317s
>>>>>> rpn5_ppr5_gpu4: #25 slots
>>>>>> real 38m54.711s
>>>>>> user 283m0.446s
>>>>>> sys 55m42.250s
>>>>>>
>>>>>> With more than 4 GPUs the GPUs were never running with full
>>>>>> utilization (>90%), but in a range of 50-70%.
>>>>>> Based on a direct comparison (real time improvement):
>>>>>> GPUs used (relative)
>>>>>> performance increase for 1 rank/GPU performance
>>>>>> increase
>>>>>> for 2
>>>>>> ranks/GPU
>>>>>> 4-->6 = 2 more
>>>>>> 30.2% (15%/GPU) 23.8% (12%/GPU)
>>>>>> 4-->7 = 3 more
>>>>>> 36.5% (12%/GPU) 27.8% (9%/GPU)
>>>>>>
>>>>>> This tells me that 6 GPUs scales better than 7 on this 28
>>>>>> core
>>>>>> machine, which is why we plan to upgrade all nodes to 6
>>>>>> TitanX.
>>>>>>
>>>>>> Surprisingly, the temperature was very moderate (~ 60-70°C,
>>>>>> 80°C
>>>>>> at
>>>>>> the peak) despite of the high packing density, but it might
>>>>>> be
>>>>>> that
>>>>>> our HPC team did some chassis fan tuning.
>>>>>>
>>>>>> Currently we are using the Univa Grid Engine, which comes
>>>>>> with
>>>>>> some
>>>>>> problems for running hybrid smp-mpi jobs, but it works.
>>>>>> Unfortunately, UGE (/SGE) cannot run hybrid smp-mpi-gpu jobs
>>>>>> on
>>>>>> several nodes, which limits your job request to one node.
>>>>>> If you want to use GPUs on several nodes, Slurm is a better
>>>>>> choice
>>>>>> and we will switch to it soon.
>>>>>>
>>>>>> However, in our case we had severe issues with core binding
>>>>>> of
>>>>>> mpi
>>>>>> processes. Often all of them were bound to the first cores,
>>>>>> even
>>>>>> when
>>>>>> a second job was started they ended up on the same cores
>>>>>> (!!!),
>>>>>> unless mpirun was started with the "--bind-to none"
>>>>>> parameter.
>>>>>> Furthermore, I recommend to provide a $GPU_ASSIGN variable
>>>>>> with
>>>>>> your
>>>>>> Relion2 (module-)installation, that generates the --gpu
>>>>>> string
>>>>>> from
>>>>>> the SGE variables ($SGE_HGR_gpu_dev, $NSLOTS and
>>>>>> $OMP_NUM_THREADS).
>>>>>> If you like, I can provide you the bash script.
>>>>>> In my opinion this is particularly important, because if the
>>>>>> --
>>>>>> gpu
>>>>>> X,x,x,x:Y,y,y,y:... parameter is not set, Relion2 will use
>>>>>> ALL
>>>>>> resources and distribute the job itself. This is particularly
>>>>>> bad
>>>>>> if
>>>>>> a second job will be started on the same node, because the
>>>>>> two
>>>>>> jobs
>>>>>> will compete for the same resources and once one job took all
>>>>>> the
>>>>>> GPU
>>>>>> memory, the other job will die. Note that the --j and the --
>>>>>> gpu
>>>>>> parameter work differently: --j takes only what you assign
>>>>>> (perfekt
>>>>>> for queueing systems), --gpu takes everything it can get,
>>>>>> unless
>>>>>> you
>>>>>> restrict it (not ideal for queueing systems).
>>>>>>
>>>>>> Concerning the OS, please note that the Nvidia drivers for
>>>>>> the
>>>>>> Pascal
>>>>>> cards are not well supported by CentOS6/RHEL6 and you might
>>>>>> want
>>>>>> to
>>>>>> switch to CentOS7/RHEL7.
>>>>>>
>>>>>> I hope this helps!
>>>>>>
>>>>>> Best,
>>>>>> Dominik
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/06/2017 12:00 PM, Beno ît Zuber wrote:
>>>>>>> Hi Daniel, Erki, and Masahide,
>>>>>>>
>>>>>>> Thank you for your feedback!
>>>>>>>
>>>>>>> Best
>>>>>>> Benoit
>>>>>>>
>>>>>>> De : "[log in to unmask]" <[log in to unmask]
>>>>>>> z.ch
>>>>>>>
>>>>>>> Date : lundi, 6 mars 2017 à 08:29
>>>>>>> À : Benoît Zuber <[log in to unmask]>
>>>>>>> Objet : AW: geforce vs tesla
>>>>>>>
>>>>>>> Hallo Benoit,
>>>>>>>
>>>>>>> ETH ist currently setting up a cluster with NVidia GTX1080
>>>>>>> GPUs
>>>>>>> for
>>>>>>> big data (https://scicomp.ethz.ch/wiki/Leonhard). We could
>>>>>>> not
>>>>>>> test
>>>>>>> it yet but Relion2 should run on the GPU nodes.
>>>>>>> Best,
>>>>>>> Daniel
>>>>>>> Von: Collaborative Computational Project in Electron cryo-
>>>>>>> Microscopy [[log in to unmask]]" im Auftrag von "Benoît
>>>>>>> Zuber
>>>>>>> [be
>>>>>>> [log in to unmask]]
>>>>>>> Gesendet: Montag, 6. März 2017 06:58
>>>>>>> An: [log in to unmask]
>>>>>>> Betreff: [ccpem] geforce vs tesla
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Our HPC cluster team is collecting wishes before building
>>>>>>> up a
>>>>>>> new
>>>>>>> GPU cluster. They consider implementing Tesla or Geforce
>>>>>>> cards.
>>>>>>> With the new 1080 Ti card and its 11 Gb RAM, is there any
>>>>>>> reason to
>>>>>>> go for Tesla cards when considering performance for Relion
>>>>>>> 2?
>>>>>>>
>>>>>>> Thanks for your input
>>>>>>> Benoit
>>>>>>>
>>>>>>
>>>>>>
>>
|