JISCMail - CCPEM Archives

Hi All,


I noticed that hyper-threading is enabled in Nicolas' configuration. Attached is a discussion on the use of hyper-threading in HPC (seems done by Dell, from google). I wonder, in case of running Relion2 with GPUs, if hyper-threading would makes any difference, maybe in optimization of CPU/GPU calculation/data transfer? And a related question, how much CPU power is optimal to not be a bottle neck when running with 8 fast GPUs like Titan or 1080s (some double float calculations are still performed with CPU right?).  Thanks a lot for any suggestions!


Best,

Weiwei


________________________________
From: Collaborative Computational Project in Electron cryo-Microscopy <[log in to unmask]> on behalf of Ali Siavosh-Haghighi <[log in to unmask]>
Sent: Friday, May 5, 2017 12:01 PM
To: [log in to unmask]
Subject: Re: [ccpem] Relion - Tests on a 8 GPU node

Hi All,
At the same time I should add that all the memory on cards fills up (11GB per card; either for 8-GPU or 4-GPU assignments).
===========================================
Ali  Siavosh-Haghighi, Ph.D.
HPC System Administrator
High Performance Computing Facility
Medical Center Information Technologies
NYU Langone Medical Center
Phone: (646) 501-2907
 http://wiki.hpc.med.nyu.edu/<https://urldefense.proofpoint.com/v2/url?u=http-3A__wiki.hpc.med.nyu.edu_&d=DwMFAw&c=JeTkUgVztGMmhKYjxsy2rfoWYibK1YmxXez1G3oNStg&r=6IXfFcPMNpcO_nRoU91jN2c8igJnu55HHChqtPBgkAw&m=Mh2M-4U-sNr8yayoEvMoXqmIxu2V5JI2GQ09unf4X18&s=k6QoOIAJZi3l8wD0KJQv5p2Izrswj0wIapYwB-GgygY&e=>
===========================================


On May 5, 2017, at 11:36 AM, Coudray, Nicolas <[log in to unmask]<mailto:[log in to unmask]>> wrote:

Hi,

Thank you all for your feedbacks!

We will let you the know the results on the benchmark dataset asap.


Regarding the number of MPIs used, there was indeed a typo and I did use "8GPUs, 9MPIs and 6 threads" in the last run of each job (except auto-picking where I did use 8 MPIs).

As for the mapping, this what I did:
for 2 GPUs, 3MPIs, 24 threads: -gpu "0:1"
for 4 GPUs, 3MPIs, 24 threads: -gpu "0,1:2,3"
for 4 GPUs, 5MPIs, 24 threads: -gpu "0:1:2:3"
for 8 GPUs:                             -gpu ""

So for all the tests with 8 GPUs, I did let Relion figure it out, so the loss of performance when going from 5 to 9 MPIs on 8 GPUs is puzzling to me.


We also noticed that for some jobs GPUs are not reaching 100% utilization. Regarding this, Bharat, you mentioned that you often run 2 mpi processes with at least 2 threads per GPU. Does it only increased the % GPU utilization, or do you also see a consequent speed improvement?




BTW, regarding the CPUs we've been using, these are the specifications:
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2499.921
BogoMIPS:              4405.38
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

Thanks,
Best
Nicolas




________________________________________
From: Bjoern Forsberg [[log in to unmask]<mailto:[log in to unmask]>]
Sent: Friday, May 05, 2017 4:51 AM
To: Coudray, Nicolas; [log in to unmask]<mailto:[log in to unmask]>
Subject: Re: [ccpem] Relion - Tests on a 8 GPU node

Hi Nicolas,

Just to clarify, is 8 MPIs a typo? I see you ran 3 and 5 MPIs on some
runs with 2 and 4 GPUs, presumably to accommodate the master rank. So I
would expect you ran 9 MPIs on some of the 8 GPU-runs, right? One reason
I'm asking is that I would expect performance to increase between these
runs, e.g.;

8 GPUs, 5 MPIs, 12 threads:   4h15
8 GPUs, 8 MPIs,  6 threads:   8h47

But you see a detrimental loss of performance. If you did actually run 8
MPIs, that might be why. Running

8 GPUs, 9 MPIs,  6 threads:   ?

should be interesting, an more relevant for performance, in that case.

Also, did you specify --gpu without any device numbers in all cases? If
you did specify GPU indices, performance is fairly sensitive to how well
you mapped the ranks and threads to GPUs. This is why we typically
advise to *not* specify which GPUs to use and let relion figure it out
on it's own, unless you want/need to specify something in particular.

Thanks for sharing!

/Björn

On 05/04/2017 10:13 PM, Coudray, Nicolas wrote:
Hi all,


We have been running and testing Relion 2.0 on our 8 GPU nodes to try to figure out the optimal parameters. We thought these results might be interesting to share, and we are looking for any suggestions / comments / similar tests that you could provide.



Our configuration is:
8 GPUs node, 48 slots, TITAN X (Pascal, 750 GB RAM, 11GB on-card), sRelion compiled with gcc 4.8.5, kermel 3.10, Centos 7.3, sd hard drive of 4TB

At first, we only varied the number of GPUs, MPIs and threads, leaving the other disk access options constant (No parallel disc I/O, particles pre-read into the RAM, no "combine iterations"). The results for each type of job are as follow:

*** 2D Classification (265k particles of 280x280 pixels, 5 rounds, 50 classes):
2 GPUs, 3 MPIs, 24 threads: 14h04
4 GPUs, 3 MPIs, 24 threads:   5h23
4 GPUs, 5 MPIs, 12 threads: 13h28
8 GPUs, 3 MPIs, 24 threads:   3h14
8 GPUs, 5 MPIs, 12 threads:   5h10
8 GPUs, 8 MPIs,   6 threads: 13h28


*** 3D Classification (226k particles  of 280x280 pixels, 5 rounds, 5 classes):
2 GPUs, 3 MPIs, 24 threads: 15h17
4 GPUs, 3 MPIs, 24 threads:   5h53
4 GPUs, 5 MPIs, 12 threads:   8h11
8 GPUs, 3 MPIs, 24 threads:   2h48
8 GPUs, 5 MPIs, 12 threads:   3h16
8 GPUs, 8 MPIs,   6 threads:   4h37


*** 3D Refinement (116k particles of 280x280 pixels):
2 GPUs, 3 MPIs, 24 threads: 12h07
4 GPUs, 3 MPIs, 24 threads:   4h54
4 GPUs, 5 MPIs, 12 threads:   9h12
8 GPUs, 3 MPIs, 24 threads:   4h57
8 GPUs, 5 MPIs, 12 threads:   4h15
8 GPUs, 8 MPIs,   6 threads:   8h47


*** Auto-picking (on 2600 micrographs, generating around 750 k particles):
0 GPU , 48 threads: 79 min
2 GPUs,   2 threads: 52 min
2 GPUs, 48 threads: error (code 11)
4 GPUs,   4 threads: 27 min
4 GPUs, 48 threads: 8 min
8 GPUs,   8 threads: 19 min
8 GPUs, 48 threads: 12 min



Does anyone have particular suggestions / feedback?

We are using these tests to guide the expansion of our centralized GPU capability and any comment is greatly welcome.


Thanks,
Best,

Nicolas Coudray
New York University



------------------------------------------------------------
This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
=================================


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
=================================