Dimitry,
FYI, I wasn't able to get this to compile with the Intel C++ compiler. It works fine with GCC. Which version of openMPI did you use?
I think it would have been helpful if you uploaded the vanilla Relion-1.4 to GitHub, and then pushed all your changes, as then it would be more straight-forward to examine the changes with git show and the web interface. I can do this by-hand with shell tools but I'm lazy.
I compared Relion-1.4-ICC with Vlion-1.4-GCC for a 3D auto-refinement. Relion-1.4 with the Intel compiler won handily on a dedicated cluster with Infinitiband. The Vlion run was a repeat of an earlier case so I don't have any timings other than what Relion outputs for the expectation steps. The fact that Vlion is very close to 2.0x slower is suggestive that it's not actually running in single-precision. There are a bunch of 99e99 constants in Relion you may want to go in and replace with macros, like FLOAT_MIN and FLOAT_MAX. This possibly could be causing issues with single-precision casting and/or function definitions, depending on the compiler and C-standard. ICC generates warnings about them.
Also it's probably important to point-out that the Intel compiler tries to automatically use vector instructions at -O2 and higher optimization levels. The compilation flag --vec-report should provide feedback as to which loops are vectorized.
VLION-1.4-GCC
=== RELION MPI setup ===
+ Number of MPI processes = 16
+ Number of threads per MPI process = 16
+ Total number of threads therefore = 256
+ Master (0) runs on host = node01.cluster
+ Slave 1 runs on host = node02.cluster
+ Slave 2 runs on host = node03.cluster
+ Slave 6 runs on host = node07.cluster
+ Slave 5 runs on host = node06.cluster
=================
+ Slave 3 runs on host = node04.cluster
+ Slave 12 runs on host = node13.cluster
+ Slave 10 runs on host = node11.cluster
+ Slave 14 runs on host = node15.cluster
+ Slave 11 runs on host = node12.cluster
+ Slave 13 runs on host = node14.cluster
+ Slave 7 runs on host = node08.cluster
+ Slave 15 runs on host = node16.cluster
+ Slave 8 runs on host = node09.cluster
+ Slave 9 runs on host = node10.cluster
+ Slave 4 runs on host = node05.cluster
Running in single precision. Runs might not be exactly reproducible.
Expectation iteration 1
4.80/4.80 min ............................................................~~(,_,">
Expectation iteration 2
20.18/20.18 min ............................................................~~(,_,">
Expectation iteration 3
10.88/10.88 min ............................................................~~(,_,">
Expectation iteration 4
12.42/12.42 min ............................................................~~(,_,">
Expectation iteration 5
10.73/10.73 min ............................................................~~(,_,">
Expectation iteration 6
9.20/9.20 min ............................................................~~(,_,">
Expectation iteration 7
8.65/8.65 min ............................................................~~(,_,">
Expectation iteration 8
1.01/1.01 hrs ............................................................~~(,_,">>
Expectation iteration 9
1.01/1.01 hrs ............................................................~~(,_,">>
Expectation iteration 10
56.25/56.23 min ............................................................~~(,_,">
Expectation iteration 11
56.98/56.97 min ............................................................~~(,_,">
Expectation iteration 12
57.18/57.17 min ............................................................~~(,_,">
Expectation iteration 13
54.57/54.57 min ............................................................~~(,_,">
Expectation iteration 14
53.78/53.77 min ............................................................~~(,_,">
Expectation iteration 15
1.10/1.10 hrs ............................................................~~(,_,">>
Expectation iteration 16
4.82/8.96 hrs ................................~~(,_,">
RELION-1.4-ICC
=== RELION MPI setup ===
+ Number of MPI processes = 16
+ Number of threads per MPI process = 16
+ Total number of threads therefore = 256
+ Master (0) runs on host = node01.cluster
+ Slave 1 runs on host = node02.cluster
+ Slave 6 runs on host = node07.cluster
+ Slave 8 runs on host = node09.cluster
+ Slave 10 runs on host = node11.cluster
+ Slave 4 runs on host = node05.cluster
+ Slave 5 runs on host = node06.cluster
+ Slave 2 runs on host = node03.cluster
+ Slave 3 runs on host = node04.cluster
+ Slave 13 runs on host = node14.cluster
=================
+ Slave 7 runs on host = node08.cluster
+ Slave 12 runs on host = node13.cluster
+ Slave 14 runs on host = node15.cluster
+ Slave 15 runs on host = node16.cluster
+ Slave 11 runs on host = node12.cluster
+ Slave 9 runs on host = node10.cluster
Running in single precision. Runs might not be exactly reproducible.
Expectation iteration 1
2.45/2.45 min ............................................................~~(,_,">
Expectation iteration 2
10.02/10.02 min ............................................................~~(,_,">
Expectation iteration 3
5.87/5.87 min ............................................................~~(,_,">
Expectation iteration 4
6.48/6.48 min ............................................................~~(,_,">
Expectation iteration 5
5.68/5.68 min ............................................................~~(,_,">
Expectation iteration 6
5.92/5.92 min ............................................................~~(,_,">
Expectation iteration 7
5.18/5.18 min ............................................................~~(,_,">
Expectation iteration 8
4.55/4.55 min ............................................................~~(,_,">
Expectation iteration 9
4.38/4.38 min ............................................................~~(,_,">
Expectation iteration 10
26.60/26.60 min ............................................................~~(,_,">
Expectation iteration 11
26.98/26.98 min ............................................................~~(,_,">
Expectation iteration 12
27.55/27.55 min ............................................................~~(,_,">
Expectation iteration 13
26.63/26.63 min ............................................................~~(,_,">
Expectation iteration 14
25.52/25.52 min ............................................................~~(,_,">
Expectation iteration 15
27.67/27.67 min ............................................................~~(,_,">
Expectation iteration 16
3.96/3.96 hrs ............................................................~~(,_,">
Robert
--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der Universität Basel
Mattenstrasse 26, 4058 Basel
Office: +41.061.387.3225
[log in to unmask]
[log in to unmask]
[log in to unmask]
________________________________________
From: Collaborative Computational Project in Electron cryo-Microscopy [[log in to unmask]] on behalf of Dimitry Tegunov [[log in to unmask]]
Sent: Sunday, January 24, 2016 10:38 PM
To: [log in to unmask]
Subject: [ccpem] Vectorized RELION
Hi Community,
the arguments about cluster time in our lab have become mildly annoying, so I sat down on Friday to vectorize the most expensive lines in RELION's code. Then I sat down again today to figure out the single-precision code path. Here are the results:
Using double-precision, I'm getting a rather consistent speed-up of 2.4x for 3D refinement, and a disappointing 1.3x for 2D classification. Contrary to my expectations, it is not affected by the number of threads (I thought memory IO would become a bottleneck much sooner). On a machine with 16 physical cores (2x E5-2640 v3), it remains the same for up to 16 threads. Once HT kicks in, it decreases to 2x for up to 32 threads.
Single-precision has been a bit of a unicorn in RELION. Normally, one would expect double the performance when going to fp32. For some unfortunate reason, computation time almost doubles in RELION's case. I have no idea why. However, the vectorized single-precision code delivers 1.2x the performance of vectorized double-precision. More specifically, the time spent on projection stays the same (down from originally double), while the computation of square differences decreases by 2x, as one would expect for a switch to fp32.
Thus, going from original double-precision to vectorized single-precision provides a speed-up of ca. 2.8x.
Results stay identical for double-precision (within the precision displayed in the text output), and deviate around the 6th digit for single-precision. This is due to the numbers being added up in a slightly different order.
You can get vlion from https://github.com/dtegunov/vlion. If you have git installed, just clone it; otherwise, use the 'Download ZIP' button. All the subsequent steps are the same as in RELION.
If you try it, please post some feedback in this thread!
Cheers,
Dimitry
|