Hi Ian,
I have no experience with EC2, but you can always compare your results
with the precalculated ones that come with the tutorial. Sometimes it
helps to execute exactly the same job as the precalculated one to find
problems.
HTH,
Sjors
On 06/10/2016 04:39 PM, Ian Tickle wrote:
> Hello, I have been running some benchmarks on the Amazon Cloud using RELION
> and the beta-gal tutorial data. I am using the 'Cryo-EM in the Cloud' AMI
> from Michael Cianfrocco & Andres Leschziner. The only problem with this is
> it's Ubuntu 13.04 which of course is not a long-term stable release so I
> should upgrade to 14.04 (at least!). Jose Miguel de la Rosa Trevin has
> very kindly made a 14.04 AMI for us to use, but I haven't tried it yet
> since it doesn't have StarCluster installed.
>
> So basically I'm running exactly the same script on AWS-EC2 clusters of
> 'm4.10xlarge' instances (each 20 core = 40 vCPU & 160 Gb RAM), but varying
> only the cluster size, the number of MPI processes and the number of
> threads. The script I'm using is based on one from Martyn Winn, and
> typically looks like:
>
> mpirun -n 15 -x LD_LIBRARY_PATH -x PATH \
> --prefix /home/EM_Packages/openmpi -hostfile ~/hosts --map-by node \
> --bind-to none time `which relion_refine_mpi` --o Refine3D/Run22-m4b
> --auto_refine --split_random_halves --i
> particles_autopick_sort_class2d_class3d.star --particle_diameter 200
> --angpix 3.54 --ref 3i3e_lp50A.mrc --firstiter_cc --ini_high 50 --ctf
> --ctf_corrected_ref --flatten_solvent --zero_mask --oversampling 1
> --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5
> --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale --j 12
> --memory_per_thread 2 >Refine3D/Run18-m4a.out 2>&1
>
> The above script was run on a cluster of 5 instances (100 cores = 200
> vCPUs); others were run on clusters of up to 16 instances with #MPI up to
> 59 and #threads = 6 or 12. The problem is that even running the identical
> script on the identically set up cluster, sometimes after a while it goes
> into a mode where it uses 1 thread per process and will run for ~ 3 hours &
> other times where it seems to work properly it runs for typically 5 to 20
> mins. I always see loads of these warnings:
>
> WARNING: norm_correction= 13.2069 for particle 5801 in group 14; Are your
> groups large enough?
>
> I see the above in all the log files, but the number of these warnings
> varies enormously even from identical input data. For example the log file
> from the first run (which otherwise seems to have worked) contained ~ 6500
> of the above lines, the second run using the identical script contained ~
> 12800, and a third run again with identical input contained ~ 13000. Other
> runs have up to 65000 of these warnings! The second run seems to have
> failed with this error:
>
> DIRECT_A1D_ELEM(sigma2, i)= 5.53033e-37
> BackProjector::reconstruct: ERROR: unexpectedly small, yet non-zero sigma2
> value, this should not happen...a
> File: src/backprojector.cpp line: 867
> DIRECT_A1D_ELEM(sigma2, i)= 5.53033e-37
> BackProjector::reconstruct: ERROR: unexpectedly small, yet non-zero sigma2
> value, this should not happen...a
> File: src/backprojector.cpp line: 867
>
> It's the "this should not happen" that worries me! The second run doesn't
> actually crash but after a while goes into a mode where it's using only 1
> thread per process (100% CPU) and then will run for ~ 3 hours (unless I
> kill it first!).
>
> Maybe just upgrading the OS will fix this?
>
> Cheers
>
> -- Ian
>
--
Sjors Scheres
MRC Laboratory of Molecular Biology
Francis Crick Avenue, Cambridge Biomedical Campus
Cambridge CB2 0QH, U.K.
tel: +44 (0)1223 267061
http://www2.mrc-lmb.cam.ac.uk/groups/scheres
|