Hi,
I found now a logfile where the motioncorr job got stuck during the re-test. No other output file exists for this micrograph:
---
$ cat FoilHole_7001963_Data_6972814_6972816_20200908_040924_fractions.log
Working on Movies/FoilHole_7001963_Data_6972814_6972816_20200908_040924_fractions.tiff with 23 thread(s).
Movie size: X = 5760 Y = 4092 N = 45
Frames to be used: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Frame grouping: n_frames = 45, requested group size = 1
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
interpolate_shifts = 0
---
The movie is ok with relion_display.
The node has 48 hyperthreaded Intel cores, 768GB main memory and the user submitted with 8(nodes)*48 mpi and 23 threads (overload?).
commandline is:
mpirun --map-by node --mca opal_warn_on_missing_libcuda 0 \
`which relion_run_motioncorr_mpi` --i Import/job001/movies.star --o MotionCorr/job139/ --first_frame_sum 1 --last_frame_sum -1 --use_own --j 23 --bin_factor 1 --bfactor 150 --dose_per_frame 1.71 --preexposure 0 --patch_x 5 --patch_y 5 --dose_weighting --pipeline_control MotionCorr/job139/
---
run.out ends with:
* Movies/FoilHole_7007127_Data_7008058_7008060_20200908_134950_fractions.tiff
Correcting beam-induced motions using our own implementation ...
26.40/26.40 min ............................................................~~(,_,">
---
After while when the job was still not finished the job was stopped by the user with scancel where probably the entry in run.err comes from:
mpirun: Forwarding signal 18 to job
slurmstepd: error: *** JOB 5538280 ON max-cssb012 CANCELLED AT 2020-09-18T14:51:49 ***
---
Below is the slurm report for this job:
User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ------------------------------
XXXXXXXX 5538280 relion3i uke CANCELLED+ 7-00:00:00 2020-09-18T01:29:27 2020-09-18T14:51:49 13:22:22 8 384 max-cssb[012,020-026]
5538280.bat+ batch CANCELLED 2020-09-18T01:29:27 2020-09-18T14:51:50 13:22:23 16495K 803292K 1 48 max-cssb012
5538280.0 orted FAILED 2020-09-18T01:29:29 2020-09-18T14:51:53 13:22:24 15079K 596244K 7 7 max-cssb[020-026]
---
I tried now the same job settings with relion gcc8, gcc4, intel2020, openmpi4, mvapich2 and I cannot reproduce the error.
So I am still clueless what's made the process stuck on the node.
What I have not tested yet is that max-cssb012 has a Intel Gold-6126 processor while the other used nodes have a slightly slower Gold-6226 processor.
But normally there were no problems before when mixing these nodes.
Regards,
Wolfgang
----- Original Message -----
From: "Takanori Nakane" <[log in to unmask]>
To: "Lugmayr, Wolfgang" <[log in to unmask]>, "CCPEM Mailinglist" <[log in to unmask]>
Cc: "Toby Darling" <[log in to unmask]>
Sent: Friday, 18 September, 2020 12:34:55
Subject: Re: [ccpem] Relion 3.1.0 motioncorr with large dataset problems
Hi,
I think this is not RELION's problem but a bug in beegfs kernel driver.
LMB's cluster had been struggling with this for more than two years.
This is triggered by heavy load generated by RELION's motion correction,
but other tasks can also cause it (e.g. Bayesian Polishing or even
non RELION programs).
When this happens, the process becomes zombie (D state) and somehow corrupts
the kernel process table. "ps aux" will stall when accessing relevant
/proc entries. Attaching to the process (strace -p or gdb) also dead locks.
The only way to recover is to reboot the node.
I added Toby to CC, who maintains our cluster and investigated this issue.
> You describe in the tutorial that relion motioncorr threads work on specific frames.
> So which MPI loads the specific movie in this case and distributes the workload?
Each MPI process works on a movie. Frames are read and processed (e.g. FFT) in parallel by threads,
but the assignment is stochastic.
Best regards,
Takanori Nakane
On 2020/09/18 11:19, Lugmayr, Wolfgang wrote:
> Dear developers,
>
> We have problems with large dataset (30.000+ movies) and the relion implementation of motioncorrection.
> To track down the problem I need a better understanding how relion loads the data in this step.
>
> The Relion version is 3.1 with all git updates from yesterday, compiled with the Intel 2020 compiler, ALTCPU=on, no CUDA.
> The movies are in the original EPU folder structure, linked into one Movies directory containing 60.000+ files (movies and coord.star from external cryolo picker).
>
> The problem is that some relion processes get stuck on the nodes and cannot be killed by SLURM, so the nodes need a reboot. Even ps cannot display the process line.
> The same relion version is used by multiple users without problems.
>
> We have tested now so far on 8 nodes with each 48 hyperthreaded cores:
> 1.) MPI = 8*48 and Threads = 1 (--map-by node)
> 2.) MPI = amount of CPU sockets+1 and Threads = useful value inside the node (MPI socket binding)
> Currently it seems, that only solution 1 causes the problem with the big dataset.
>
> You describe in the tutorial that relion motioncorr threads work on specific frames.
> So which MPI loads the specific movie in this case and distributes the workload?
>
> And could the data loading step be a bottleneck if the filesystem is slow or a storage head is overloaded?
> Our data are distributed via BeeGFS over 5 storage heads serving 3PB+.
>
> Typical projects with less than 10.000 movies do not have this problem.
>
> Thanks in advance for your help,
> Regards,
> Wolfgang
>
>
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCPEM&A=1
This message was issued to members of www.jiscmail.ac.uk/CCPEM, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
|