Print

Print


Hi Steven,

My experience with exit code 11 is that it indicates that your job is hitting a segmentation fault. There’s a few different notations out there for setting up jobs to run on SLURM but for relion jobs I typically specify number of tasks, cpus per task, and memory per cpu to know exactly how much RAM each MPI rank has to work with. I also tend to specify error and output file locations as these can be useful for troubleshooting (if you get a seg fault the error file will say that you ran out of contiguous memory).

Here’s a sample submission script I use for 3D refinement for comparison, obviously some aspects are specific to the system you’re using but hopefully this framework is helpful to you.

Best,
Gavin

#!/bin/bash
#SBATCH --job-name=Refine3D
#SBATCH --nodes=6
#SBATCH --ntasks=24
#SBATCH --mem-per-cpu=15g
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=4
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:2
#SBATCH --error=/path/to/job/relion_run.err
#SBATCH --output=/path/to/job/relion_run.out
#SBATCH --mail-type=ALL
#SBATCH [log in to unmask]

module load gnu/5.4.0
module load mvapich2/2.2
module load cuda/9.2.148.1
module load relion/3.0_mvapich_enabled_mpi
module load pmix

srun --mpi=pmi2 `which relion_refine_mpi` (refinement command from relion)



From: Collaborative Computational Project in Electron cryo-Microscopy <[log in to unmask]> on behalf of "Markus,Steven" <[log in to unmask]>
Reply-To: "Markus,Steven" <[log in to unmask]>
Date: Thursday, 11 April 2019 at 10:34 am
To: "[log in to unmask]" <[log in to unmask]>
Subject: [ccpem] MPI job on relion with 3D auto-refine

Hello all,

I’ve made some progress on my MPI run problems with Relion on an HPC. I can run MPI jobs for at least 2D averages, and 3D classification. However, when I try to run a 3D auto-refine job using MPI, the job fails after a few seconds, and the slurm output looks like this:

RELION version: 3.0
Precision: BASE=double, CUDA-ACC=single

 === RELION MPI setup ===
 + Number of MPI processes             = 4
 + Master  (0) runs on host            = sgpu0401
 + Slave     1 runs on host            = sgpu0401
 + Slave     2 runs on host            = sgpu0401
 + Slave     3 runs on host            = sgpu0401 =================

 uniqueHost sgpu0401 has 3 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 2 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 3 mapped to device 2

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 16810 RUNNING AT sgpu0401
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 16810 RUNNING AT sgpu0401
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================


Here’s the command that was submitted:


#!/bin/bash
#SBATCH --job-name=RELION_GUI_TASK
#SBATCH --time=24:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --partition=sgpu
#SBATCH --distribution=arbitrary

cat $0 > [log in to unmask]<mailto:[log in to unmask]>/WTdynein/myscript.$SLURM_JOB_ID

module load intel impi cuda/9.1.85

mpirun -n 4 `which relion_refine_mpi` --o Refine3D/job086/run --auto_refine --split_random_halves --i Extract/job080/particles.star --ref Class3D/job041/run_it025_class001_box256.mrc --firstiter_cc --i$

As usual, any and all help would be greatly appreciated!
Steven




________________________________

To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1

########################################################################

To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1