Dear Chris, Sjors and all others on this list,
We have had similar problems that our refinement jobs quit prematurely.
The admin for our in-house cluster had a monitor for the sizes of packets
passing through the switch boxes and blamed the limited RAM in our system
to be the culprit. But the size of the passing packets was topped at < 300
MB, and should not lead to premature quit. Last week I moved one
refinement job to the TACC STAMPEDE supercomputer where I could use four
1TB RAM nodes to test whether it was the limited memory that caused the
problem. The result was that with super large RAM, the refinement was able
to finish without glitch. In TACC, I have four nodes, and each node has 16
cores and each core launches 64 threads and 1.0 GB RAM per thread. At
home, we recently spent some money to upgrade all nodes to either 128 GB
or 256GB RAM, and when we ran the job, we limited to 2 cores per node, 12
threads per core, and 4 GB per thread. We had much more RAM left than in
TACC, but we still encountered premature quit, usually after the 7-8
iterations of refinement, and always as Chris pointed out, at the end of
an iteration. It is quite puzzling. We would like to know what is causing
it and how to manage it so that we donšt have to use TACC supercomputer if
possible.
BTW, we have been using Relion 1.3 and Relion 1.2 with a patch for helical
reconstruction.
Thanks for sharing your thoughts and experience.
Qiu-Xing
On 5/20/15, 7:41 PM, "Christopher Akey" <[log in to unmask]> wrote:
>Sjors and users-
>
>My previous refinements were running fine on 2 different clusters, but
>with occasional instances where the
>job would just stop, usually at the end of an iteration, one could just
>continue the job and would not
>every understand why it stopped. The errors often involve an early or
>improper exit by a process.
>
>After tweaking my openMPI submit command using -hostfile to use as many
>cores as possible while not having
>the job quit early due to insufficient memory, the job ran fine for the
>entire 3d refine iteration but
>crashed during the less memory intensive maximization (as below). It is
>very hard to know what to try
>next without understanding the cause of an "improper" exit.
>
>This job only differs from previous ones in that it is using a mask but
>this hardly seems a likely
>problem.
>
>To conserve memory I was using 36 of the cores on a 120 core cluster with
>the jobs spread
> amongst them using a hostfile to specify the number of slots/cores per
>node,
> based on how much memory the node has, but node1 which exited improperly
> belongs to the group with less memory, and thus proportionately fewer
>processes.
>
>Any suggestions??? I'd like to understand this problem.
>
>C Akey
>
>
>Expectation iteration 7
>3.78/3.78 hrs
>............................................................~~(,_,">
> Averaging half-reconstructions up to 40 Angstrom resolution to prevent
>diverging orientations ...
> Note that only for higher resolutions the FSC-values are according to
>the gold-standard!
> Calculating gold-standard FSC ...
> Maximization ...
>000/??? sec ~~(,_,">
>
>--------------------------------------------------------------------------
>-----------------------
>
>mpirun has exited due to process rank 2 with PID 6982 on
>node compute-0-1.local exiting improperly. There are two reasons this
>could occur:
>
>1. this process did not call "init" before exiting, but others in
>the job did. This can cause a job to hang indefinitely while it waits
>for all processes to call "init". By rule, if one process calls "init",
>then ALL processes must call "init" prior to termination.
>
>2. this process called "init", but exited without calling "finalize".
>By rule, all processes that call "init" MUST call "finalize" prior to
>exiting or it will be considered an "abnormal termination"
>
>This may have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>
>--------------------------------------------------------------------------
>------------------------
________________________________
UT Southwestern
Medical Center
The future of medicine, today.
|