Hi Community,
the arguments about cluster time in our lab have become mildly annoying, so I sat down on Friday to vectorize the most expensive lines in RELION's code. Then I sat down again today to figure out the single-precision code path. Here are the results:
Using double-precision, I'm getting a rather consistent speed-up of 2.4x for 3D refinement, and a disappointing 1.3x for 2D classification. Contrary to my expectations, it is not affected by the number of threads (I thought memory IO would become a bottleneck much sooner). On a machine with 16 physical cores (2x E5-2640 v3), it remains the same for up to 16 threads. Once HT kicks in, it decreases to 2x for up to 32 threads.
Single-precision has been a bit of a unicorn in RELION. Normally, one would expect double the performance when going to fp32. For some unfortunate reason, computation time almost doubles in RELION's case. I have no idea why. However, the vectorized single-precision code delivers 1.2x the performance of vectorized double-precision. More specifically, the time spent on projection stays the same (down from originally double), while the computation of square differences decreases by 2x, as one would expect for a switch to fp32.
Thus, going from original double-precision to vectorized single-precision provides a speed-up of ca. 2.8x.
Results stay identical for double-precision (within the precision displayed in the text output), and deviate around the 6th digit for single-precision. This is due to the numbers being added up in a slightly different order.
You can get vlion from https://github.com/dtegunov/vlion. If you have git installed, just clone it; otherwise, use the 'Download ZIP' button. All the subsequent steps are the same as in RELION.
If you try it, please post some feedback in this thread!
Cheers,
Dimitry
|