On Mar 6, 2017, at 10:51, Bharat Reddy <[log in to unmask]> wrote:

Hi Dominik,

Did you look at the clock speed of your GPU's during the run? The fact
you said you had 80C peaks leads me to believe your GPUs might be
throttling themselves. The consumer based founders editions cards I've
used have horrible default fan settings. They prioritize noise over
performance. This in turn causes your cards to start throttling and can
hurt your performance by 5-20%. In order to bypass the default gpu fan
settings you need to enable coolbits. The problem is that cool bits
only works on screens that have X sessions running on them. To by pass
this problem the following github project by Boris Dimitrov based on
the work of Axel Kohlmeyer solves it: https://github.com/boris-dimitrov
/set_gpu_fans_public . It sets up dummy screens and X sessions on each
gpu and has a nifty script to automatically ramp up and down the gpu
fan as needed based on gpu temperature. I used this on all our
workstations and this helps keep the gpus running at nearly top speeds
at all times and overall temperature below 78C. The only drawback is
noise.

Cheers,
BR

_________________________________
Bharat Reddy, Ph.D.
Perozo Lab, University of Chicago
Email: [log in to unmask]
Tel: (773) 834 - 4734

On Mon, 2017-03-06 at 17:12 +0100, Dominik A. Herbst wrote:
Hi Benoit,

Approx. half a year go we bought together with our HPC team six GPU
nodes from DALCO (dual Xeon E5-2680, 512 GB RAM, 1.6 TB Intel NVMe
PCIe SSD, 4xTitan X, Infiniband, 2 HE chassis, CentOS 7).
Before we bought them we were running Relion2 jobs on 4x GTX 1080 GPU
workstations (DALCO, Samsung M2 NVMe 950 Pro, i7-6900K, 64-128 GB
RAM, 1 GBit/s Ethernet, CentOS 6) as described on Erik Lindahl's
homepage.

I did plenty of benchmarking. All tests were done using the same data
set and random seed. In all cases the Titan-X GPU nodes showed
approx. 25 % higher performance, which is in agreement to literature.
(The workstations with 4x GTX 1080 had a performance of approx. 10
cluster nodes with dual Xeon E5-2650, 64 GB RAM, infiniband (~ 320
cores / no GPUs))
The 2 HE GPU node chassis/boards provide 8 PCIe slots, whereof 4 are
used for TitanXs and one for the scratch SSD. In order to check how
performance scales with more TitanX we equipped one node with 7
TitanX. I was running benchmarks using the scratch with the same
random seed and the same data set (120,000 ptcls, 210px box).

Results for the usage of all possible resources for 7, 6 and 4 GPUs
on one GPU node:
(rpn = MPI ranks per node; ppr = processes per rank / threads; gpu =
gpus)

rpn14_ppr2_gpu7: #28 slots
real    23m46.751s
user    309m5.491s
sys    50m48.316s
rpn8_ppr3_gpu7: #24 slots
real    24m42.213s
user    234m41.726s
sys    41m33.935s

rpn13_ppr2_gpu6: #26 slots
real    25m6.495s
user    312m3.790s
sys    52m24.046s
rpn7_ppr4_gpu6:#28 slots
real    27m9.320s
user    255m1.701s
sys    48m11.140s

rpn9_ppr3_gpu4: #27 slots
real    32m54.921s
user    370m40.358s
sys    61m44.317s
rpn5_ppr5_gpu4: #25 slots
real    38m54.711s
user    283m0.446s
sys    55m42.250s

With more than 4 GPUs the GPUs were never running with full
utilization (>90%), but in a range of 50-70%.
Based on a direct comparison (real time improvement):
GPUs used (relative)
performance increase for 1 rank/GPU performance increase for 2
ranks/GPU
4-->6 = 2 more
30.2% (15%/GPU) 23.8% (12%/GPU)
4-->7 = 3 more
36.5% (12%/GPU) 27.8% (9%/GPU)

This tells me that 6 GPUs scales better than 7 on this 28 core
machine, which is why we plan to upgrade all nodes to 6 TitanX.

Surprisingly, the temperature was very moderate (~ 60-70°C, 80°C at
the peak) despite of the high packing density, but it might be that
our HPC team did some chassis fan tuning.

Currently we are using the Univa Grid Engine, which comes with some
problems for running hybrid smp-mpi jobs, but it works.
Unfortunately, UGE (/SGE) cannot run hybrid smp-mpi-gpu jobs on
several nodes, which limits your job request to one node.
If you want to use GPUs on several nodes, Slurm is a better choice
and we will switch to it soon.

However, in our case we had severe issues with core binding of mpi
processes. Often all of them were bound to the first cores, even when
a second job was started they ended up on the same cores (!!!),
unless mpirun was started with the "--bind-to none" parameter.
Furthermore, I recommend to provide a $GPU_ASSIGN variable with your
Relion2 (module-)installation, that generates the --gpu string from
the SGE variables ($SGE_HGR_gpu_dev, $NSLOTS and $OMP_NUM_THREADS).
If you like, I can provide you the bash script.
In my opinion this is particularly important, because if the --gpu
X,x,x,x:Y,y,y,y:... parameter is not set, Relion2 will use ALL
resources and distribute the job itself. This is particularly bad if
a second job will be started on the same node, because the two jobs
will compete for the same resources and once one job took all the GPU
memory, the other job will die. Note that the --j and the --gpu
parameter work differently: --j takes only what you assign (perfekt
for queueing systems), --gpu takes everything it can get, unless you
restrict it (not ideal for queueing systems).

Concerning the OS, please note that the Nvidia drivers for the Pascal
cards are not well supported by CentOS6/RHEL6 and you might want to
switch to CentOS7/RHEL7.

I hope this helps!

Best,
Dominik

On 03/06/2017 12:00 PM, Beno ît Zuber wrote:
Hi Daniel, Erki, and Masahide,

Thank you for your feedback!

Best
Benoit

De : "[log in to unmask]" <[log in to unmask]>
Date : lundi, 6 mars 2017 à 08:29
À : Benoît Zuber <[log in to unmask]>
Objet : AW: geforce vs tesla

Hallo Benoit,

ETH ist currently setting up a cluster with NVidia GTX1080 GPUs for
big data (https://scicomp.ethz.ch/wiki/Leonhard). We could not test
it yet but Relion2 should run on the GPU nodes.
Best,
Daniel
Von: Collaborative Computational Project in Electron cryo-
Microscopy [[log in to unmask]]" im Auftrag von "Benoît Zuber [be
[log in to unmask]]
Gesendet: Montag, 6. März 2017 06:58
An: [log in to unmask]
Betreff: [ccpem] geforce vs tesla

Hello,

Our HPC cluster team is collecting wishes before building up a new
GPU cluster. They consider implementing Tesla or Geforce cards.
With the new 1080 Ti card and its 11 Gb RAM, is there any reason to
go for Tesla cards when considering performance for Relion 2?

Thanks for your input
Benoit