JISCMail - CCPEM Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
CCPEM Archives

CCPEM@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		CCPEM Home
		CCPEM January 2016
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Vectorized RELION
From:
Robert McLeod <[log in to unmask]>
Reply-To:
Robert McLeod <[log in to unmask]>
Date:
Wed, 27 Jan 2016 11:15:59 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (194 lines)
Dimitry,

FYI, I wasn't able to get this to compile with the Intel C++ compiler. It works fine with GCC.  Which version of openMPI did you use?

I think it would have been helpful if you uploaded the vanilla Relion-1.4 to GitHub, and then pushed all your changes, as then it would be more straight-forward to examine the changes with git show and the web interface.  I can do this by-hand with shell tools but I'm lazy.

I compared Relion-1.4-ICC with Vlion-1.4-GCC for a 3D auto-refinement.  Relion-1.4 with the Intel compiler won handily on a dedicated cluster with Infinitiband.  The Vlion run was a repeat of an earlier case so I don't have any timings other than what Relion outputs for the expectation steps.  The fact that Vlion is very close to 2.0x slower is suggestive that it's not actually running in single-precision. There are a bunch of 99e99 constants in Relion you may want to go in and replace with macros, like FLOAT_MIN and FLOAT_MAX.  This possibly could be causing issues with single-precision casting and/or function definitions, depending on the compiler and C-standard.  ICC generates warnings about them.

Also it's probably important to point-out that the Intel compiler tries to automatically use vector instructions at -O2 and higher optimization levels.  The compilation flag --vec-report should provide feedback as to which loops are vectorized.  

VLION-1.4-GCC

=== RELION MPI setup ===
 + Number of MPI processes             = 16
 + Number of threads per MPI process  = 16
 + Total number of threads therefore  = 256
 + Master  (0) runs on host            = node01.cluster
 + Slave     1 runs on host            = node02.cluster
 + Slave     2 runs on host            = node03.cluster
 + Slave     6 runs on host            = node07.cluster
 + Slave     5 runs on host            = node06.cluster
 =================
 + Slave     3 runs on host            = node04.cluster
 + Slave    12 runs on host            = node13.cluster
 + Slave    10 runs on host            = node11.cluster
 + Slave    14 runs on host            = node15.cluster
 + Slave    11 runs on host            = node12.cluster
 + Slave    13 runs on host            = node14.cluster
 + Slave     7 runs on host            = node08.cluster
 + Slave    15 runs on host            = node16.cluster
 + Slave     8 runs on host            = node09.cluster
 + Slave     9 runs on host            = node10.cluster
 + Slave     4 runs on host            = node05.cluster
 Running in single precision. Runs might not be exactly reproducible.

 Expectation iteration 1
4.80/4.80 min ............................................................~~(,_,">
 
 Expectation iteration 2
20.18/20.18 min ............................................................~~(,_,">
 
 Expectation iteration 3
10.88/10.88 min ............................................................~~(,_,">

 Expectation iteration 4
12.42/12.42 min ............................................................~~(,_,">

 Expectation iteration 5
10.73/10.73 min ............................................................~~(,_,">

 Expectation iteration 6
9.20/9.20 min ............................................................~~(,_,">
 
 Expectation iteration 7
8.65/8.65 min ............................................................~~(,_,">
 
 Expectation iteration 8
1.01/1.01 hrs ............................................................~~(,_,">>
 
 Expectation iteration 9
1.01/1.01 hrs ............................................................~~(,_,">>
 
 Expectation iteration 10
56.25/56.23 min ............................................................~~(,_,">
 
 Expectation iteration 11
56.98/56.97 min ............................................................~~(,_,">
 
 Expectation iteration 12
57.18/57.17 min ............................................................~~(,_,">
 
 Expectation iteration 13
54.57/54.57 min ............................................................~~(,_,">
 
 Expectation iteration 14
53.78/53.77 min ............................................................~~(,_,">

 Expectation iteration 15
1.10/1.10 hrs ............................................................~~(,_,">>
 
 Expectation iteration 16
4.82/8.96 hrs ................................~~(,_,">


RELION-1.4-ICC

 === RELION MPI setup ===
 + Number of MPI processes             = 16
 + Number of threads per MPI process  = 16
 + Total number of threads therefore  = 256
 + Master  (0) runs on host            = node01.cluster
 + Slave     1 runs on host            = node02.cluster
 + Slave     6 runs on host            = node07.cluster
 + Slave     8 runs on host            = node09.cluster
 + Slave    10 runs on host            = node11.cluster
 + Slave     4 runs on host            = node05.cluster
 + Slave     5 runs on host            = node06.cluster
 + Slave     2 runs on host            = node03.cluster
 + Slave     3 runs on host            = node04.cluster
 + Slave    13 runs on host            = node14.cluster
 =================
 + Slave     7 runs on host            = node08.cluster
 + Slave    12 runs on host            = node13.cluster
 + Slave    14 runs on host            = node15.cluster
 + Slave    15 runs on host            = node16.cluster
 + Slave    11 runs on host            = node12.cluster
 + Slave     9 runs on host            = node10.cluster
 Running in single precision. Runs might not be exactly reproducible.

 Expectation iteration 1
2.45/2.45 min ............................................................~~(,_,">

 Expectation iteration 2
10.02/10.02 min ............................................................~~(,_,">

 Expectation iteration 3
5.87/5.87 min ............................................................~~(,_,">

 Expectation iteration 4
6.48/6.48 min ............................................................~~(,_,">

 Expectation iteration 5
5.68/5.68 min ............................................................~~(,_,">

 Expectation iteration 6
5.92/5.92 min ............................................................~~(,_,">

 Expectation iteration 7
5.18/5.18 min ............................................................~~(,_,">

 Expectation iteration 8
4.55/4.55 min ............................................................~~(,_,">

 Expectation iteration 9
4.38/4.38 min ............................................................~~(,_,">

 Expectation iteration 10
26.60/26.60 min ............................................................~~(,_,">

 Expectation iteration 11
26.98/26.98 min ............................................................~~(,_,">

 Expectation iteration 12
27.55/27.55 min ............................................................~~(,_,">

 Expectation iteration 13
26.63/26.63 min ............................................................~~(,_,">

 Expectation iteration 14
25.52/25.52 min ............................................................~~(,_,">

 Expectation iteration 15
27.67/27.67 min ............................................................~~(,_,">

 Expectation iteration 16
3.96/3.96 hrs ............................................................~~(,_,">


Robert

--
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der Universität Basel
Mattenstrasse 26, 4058 Basel
Office: +41.061.387.3225
[log in to unmask]
[log in to unmask]
[log in to unmask]

________________________________________
From: Collaborative Computational Project in Electron cryo-Microscopy [[log in to unmask]] on behalf of Dimitry Tegunov [[log in to unmask]]
Sent: Sunday, January 24, 2016 10:38 PM
To: [log in to unmask]
Subject: [ccpem] Vectorized RELION

Hi Community,

the arguments about cluster time in our lab have become mildly annoying, so I sat down on Friday to vectorize the most expensive lines in RELION's code. Then I sat down again today to figure out the single-precision code path. Here are the results:

Using double-precision, I'm getting a rather consistent speed-up of 2.4x for 3D refinement, and a disappointing 1.3x for 2D classification. Contrary to my expectations, it is not affected by the number of threads (I thought memory IO would become a bottleneck much sooner). On a machine with 16 physical cores (2x E5-2640 v3), it remains the same for up to 16 threads. Once HT kicks in, it decreases to 2x for up to 32 threads.

Single-precision has been a bit of a unicorn in RELION. Normally, one would expect double the performance when going to fp32. For some unfortunate reason, computation time almost doubles in RELION's case. I have no idea why. However, the vectorized single-precision code delivers 1.2x the performance of vectorized double-precision. More specifically, the time spent on projection stays the same (down from originally double), while the computation of square differences decreases by 2x, as one would expect for a switch to fp32.

Thus, going from original double-precision to vectorized single-precision provides a speed-up of ca. 2.8x.

Results stay identical for double-precision (within the precision displayed in the text output), and deviate around the 6th digit for single-precision. This is due to the numbers being added up in a slightly different order.

You can get vlion from https://github.com/dtegunov/vlion. If you have git installed, just clone it; otherwise, use the 'Download ZIP' button. All the subsequent steps are the same as in RELION.

If you try it, please post some feedback in this thread!

Cheers,
Dimitry
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options