Hi Dominik,
Did you look at the clock speed of your GPU's during the run? The fact
you said you had 80C peaks leads me to believe your GPUs might be
throttling themselves. The consumer based founders editions cards I've
used have horrible default fan settings. They prioritize noise over
performance. This in turn causes your cards to start throttling and can
hurt your performance by 5-20%. In order to bypass the default gpu fan
settings you need to enable coolbits. The problem is that cool bits
only works on screens that have X sessions running on them. To by pass
this problem the following github project by Boris Dimitrov based on
the work of Axel Kohlmeyer solves it: https://github.com/boris-dimitrov
/set_gpu_fans_public . It sets up dummy screens and X sessions on each
gpu and has a nifty script to automatically ramp up and down the gpu
fan as needed based on gpu temperature. I used this on all our
workstations and this helps keep the gpus running at nearly top speeds
at all times and overall temperature below 78C. The only drawback is
noise.
Cheers,
BR
_________________________________
Bharat Reddy, Ph.D.
Perozo Lab, University of Chicago
Email: [log in to unmask]
Tel: (773) 834 - 4734
On Mon, 2017-03-06 at 17:12 +0100, Dominik A. Herbst wrote:
> Hi Benoit,
>
> Approx. half a year go we bought together with our HPC team six GPU
> nodes from DALCO (dual Xeon E5-2680, 512 GB RAM, 1.6 TB Intel NVMe
> PCIe SSD, 4xTitan X, Infiniband, 2 HE chassis, CentOS 7).
> Before we bought them we were running Relion2 jobs on 4x GTX 1080 GPU
> workstations (DALCO, Samsung M2 NVMe 950 Pro, i7-6900K, 64-128 GB
> RAM, 1 GBit/s Ethernet, CentOS 6) as described on Erik Lindahl's
> homepage.
>
> I did plenty of benchmarking. All tests were done using the same data
> set and random seed. In all cases the Titan-X GPU nodes showed
> approx. 25 % higher performance, which is in agreement to literature.
> (The workstations with 4x GTX 1080 had a performance of approx. 10
> cluster nodes with dual Xeon E5-2650, 64 GB RAM, infiniband (~ 320
> cores / no GPUs))
> The 2 HE GPU node chassis/boards provide 8 PCIe slots, whereof 4 are
> used for TitanXs and one for the scratch SSD. In order to check how
> performance scales with more TitanX we equipped one node with 7
> TitanX. I was running benchmarks using the scratch with the same
> random seed and the same data set (120,000 ptcls, 210px box).
>
>
> Results for the usage of all possible resources for 7, 6 and 4 GPUs
> on one GPU node:
> (rpn = MPI ranks per node; ppr = processes per rank / threads; gpu =
> gpus)
>
> rpn14_ppr2_gpu7: #28 slots
> real 23m46.751s
> user 309m5.491s
> sys 50m48.316s
> rpn8_ppr3_gpu7: #24 slots
> real 24m42.213s
> user 234m41.726s
> sys 41m33.935s
>
> rpn13_ppr2_gpu6: #26 slots
> real 25m6.495s
> user 312m3.790s
> sys 52m24.046s
> rpn7_ppr4_gpu6:#28 slots
> real 27m9.320s
> user 255m1.701s
> sys 48m11.140s
>
> rpn9_ppr3_gpu4: #27 slots
> real 32m54.921s
> user 370m40.358s
> sys 61m44.317s
> rpn5_ppr5_gpu4: #25 slots
> real 38m54.711s
> user 283m0.446s
> sys 55m42.250s
>
> With more than 4 GPUs the GPUs were never running with full
> utilization (>90%), but in a range of 50-70%.
> Based on a direct comparison (real time improvement):
> GPUs used (relative)
> performance increase for 1 rank/GPU performance increase for 2
> ranks/GPU
> 4-->6 = 2 more
> 30.2% (15%/GPU) 23.8% (12%/GPU)
> 4-->7 = 3 more
> 36.5% (12%/GPU) 27.8% (9%/GPU)
>
> This tells me that 6 GPUs scales better than 7 on this 28 core
> machine, which is why we plan to upgrade all nodes to 6 TitanX.
>
> Surprisingly, the temperature was very moderate (~ 60-70°C, 80°C at
> the peak) despite of the high packing density, but it might be that
> our HPC team did some chassis fan tuning.
>
> Currently we are using the Univa Grid Engine, which comes with some
> problems for running hybrid smp-mpi jobs, but it works.
> Unfortunately, UGE (/SGE) cannot run hybrid smp-mpi-gpu jobs on
> several nodes, which limits your job request to one node.
> If you want to use GPUs on several nodes, Slurm is a better choice
> and we will switch to it soon.
>
> However, in our case we had severe issues with core binding of mpi
> processes. Often all of them were bound to the first cores, even when
> a second job was started they ended up on the same cores (!!!),
> unless mpirun was started with the "--bind-to none" parameter.
> Furthermore, I recommend to provide a $GPU_ASSIGN variable with your
> Relion2 (module-)installation, that generates the --gpu string from
> the SGE variables ($SGE_HGR_gpu_dev, $NSLOTS and $OMP_NUM_THREADS).
> If you like, I can provide you the bash script.
> In my opinion this is particularly important, because if the --gpu
> X,x,x,x:Y,y,y,y:... parameter is not set, Relion2 will use ALL
> resources and distribute the job itself. This is particularly bad if
> a second job will be started on the same node, because the two jobs
> will compete for the same resources and once one job took all the GPU
> memory, the other job will die. Note that the --j and the --gpu
> parameter work differently: --j takes only what you assign (perfekt
> for queueing systems), --gpu takes everything it can get, unless you
> restrict it (not ideal for queueing systems).
>
> Concerning the OS, please note that the Nvidia drivers for the Pascal
> cards are not well supported by CentOS6/RHEL6 and you might want to
> switch to CentOS7/RHEL7.
>
> I hope this helps!
>
> Best,
> Dominik
>
>
>
>
> On 03/06/2017 12:00 PM, Beno ît Zuber wrote:
> > Hi Daniel, Erki, and Masahide,
> >
> > Thank you for your feedback!
> >
> > Best
> > Benoit
> >
> > De : "[log in to unmask]" <[log in to unmask]>
> > Date : lundi, 6 mars 2017 à 08:29
> > À : Benoît Zuber <[log in to unmask]>
> > Objet : AW: geforce vs tesla
> >
> > Hallo Benoit,
> >
> > ETH ist currently setting up a cluster with NVidia GTX1080 GPUs for
> > big data (https://scicomp.ethz.ch/wiki/Leonhard). We could not test
> > it yet but Relion2 should run on the GPU nodes.
> > Best,
> > Daniel
> > Von: Collaborative Computational Project in Electron cryo-
> > Microscopy [[log in to unmask]]" im Auftrag von "Benoît Zuber [be
> > [log in to unmask]]
> > Gesendet: Montag, 6. März 2017 06:58
> > An: [log in to unmask]
> > Betreff: [ccpem] geforce vs tesla
> >
> > Hello,
> >
> > Our HPC cluster team is collecting wishes before building up a new
> > GPU cluster. They consider implementing Tesla or Geforce cards.
> > With the new 1080 Ti card and its 11 Gb RAM, is there any reason to
> > go for Tesla cards when considering performance for Relion 2?
> >
> > Thanks for your input
> > Benoit
> >
>
|