Relion users and Sjors:
I am using a version of Relion (1.4 beta) that Sjors provided to help deal with a memory issue when making the movie.star file
for a larger data set (42K particles).
The next step in processing is to align each ptcl within its movie stackfile and create a data.star file that contains this information
that is used in the subsequent step to polish the ptcls, which actually creates new ptcl files (as mrcs).
Since I have a mixed cluster with older and newer nodes, for memory intensive jobs I run on a queue with the new nodes
which have 12 cores and 2.7 Gb/core so a total of 32 Gb available. Now these nodes were used to run the final
3d refinement of the unshiny ptcls successfully, so there is enough memory for this step.
However, the next step with move ptcls stopped after about 12 hrs with the usual cryptic error>
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 28201 on node compute-0-7.local exited on signal 9 (Killed).
--------------------------------------------------------------------------
which in the past I have associated with a memory issue or some unspecified cluster issue with relion.
We have run these jobs with other data sets before, but the data sets were smaller (usually about 20K ptcls).
We often just run the job again when it crashes and if we are lucky it goes through to completion.
Is there any way to trouble shoot the problem and solve it, really can't buy more memory for our cluster.
Is it possible to split the optimizer and movie star files each into 2 files with the same headers, and run "smaller jobs" and then combine the
final data.star files before the next step in processing?
C Akey
|