hi dimitry,
if you want to get the best out of icc try:
CXX="icpc" CXXFLAGS="-O3 -xHOST" \
CC="icc" CFLAGS="-O3 -xHOST" \
FC="ifort" FCFLAGS="-O3 -xHOST" \
LIBS="-lrt -lutil" \
./INSTALL.sh -j 8 2>&1 | tee install.log
the LIBS= depends on your linux distribution.
we made a lot of experiments with gcc, icc, openmpi, mvapich2, hwloc:
mpi threads, system threads, socket bindings etc. but the results vary
depending on data size and relion 2D or 3D programs.
so finally we use the intel version with openmpi on one location
(ethernet 10g) and intel with mvapich2 on another (infiniband) location.
also 2 different submission/run templates for 3D (hwloc) and rest.
especially in 3D sometimes we run out of main memory (6gb/core) so we
must cut down the number of used cores per node.
cheers,
wolfgang
On 01/29/2016 10:30 PM, Dimitry Tegunov wrote:
> Hi Everyone,
>
> after some extensive testing, I would like to comment on the questions raised this week. Disclaimer: this was my first time using AVX and Intel's compiler -- I would fully expect more experienced people to do better. Everything tested locally on 2x E5-2640v3, Ubuntu 14.04, GCC 5.2.1, OpenMPI 1.10.2, ICC 16.0.1.
>
> Different resolution/results: I knocked myself out trying to trace this discrepancy back to my changes. In every deterministic scenario, vlion matches RELION almost down to machine precision. However, non-deterministic scenarios in the original RELION (i. e. those using MPI parallelization) are more random than expected. This has nothing to do with my changes, but seems to be by design. As mentioned a few days ago, multiple runs with identical parameters start diverging after the first iteration >even when using double-precision<. A colleague was very kind to test this on her current project using our cluster (RELION 1.4 fp64, Intel MPI). I'm attaching a plot of the intermediate resolutions of 3 identical runs. Please note the different number of iterations to convergence, and the final unmasked resolution in one of them differing by 1 shell.
>
> I think this behavior is perfectly fine since everything converges to the same result within the FSC's precision. However, it is unrelated to vlion.
>
> Intel's compiler: Vectorization happens in all the relevant loops in ml_optimiser.cpp, but is absent from the bi-/trilinear interpolation in projector.cpp (maybe because it's not a loop). For some reason, ICC's vectorized loop is still slower than mine, but certainly faster than the original. Overall, I see Intel's version run 1.3x faster than the original in double-precision. In single-precision, the unexplained slow-down still happens, although it comes out faster than the original.
>
> I didn't have any problems compiling vlion with ICC. However, in double-precision the time spent on projection went through the roof, making it ca. 2x slower than ICC fp64 RELION. This doesn't happen in single-precision, i. e. it's 2.7x faster than ICC fp32 RELION, but a bit slower than GCC fp32 vlion.
>
> Robert, I'm afraid I don't have a good explanation for the difference between our results. I used ICC with the original install script, no extra flags. This seemed to employ everything up to AVX2, as explicitly setting -mavx made it a bit slower. Is it possible that you actually compiled vlion with ICC and in fp64 by accident?
>
> Cheers,
> Dimitry
--
Universitätsklinikum Hamburg-Eppendorf (UKE)
@ Centre for Structral Systems Biology (CSSB)
@ Institute of Molecular Biotechnology (IMBA)
Dr. Bohr-Gasse 3-7 (Room 6.14)
1030 Vienna, Austria
Tel.: +43 (1) 790 44-4649
Email: [log in to unmask]
http://www.cssb-hamburg.de/
--
_____________________________________________________________________
Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; Gerichtsstand: Hamburg | www.uke.de
Vorstandsmitglieder: Prof. Dr. Burkhard Göke (Vorsitzender), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Prölß, Rainer Schoppik
_____________________________________________________________________
SAVE PAPER - THINK BEFORE PRINTING
|