Hi Steven,

My experience with exit code 11 is that it indicates that your job is hitting a segmentation fault. There’s a few different notations out there for setting up jobs to run on SLURM but for relion jobs I typically specify number of tasks, cpus per task, and memory per cpu to know exactly how much RAM each MPI rank has to work with. I also tend to specify error and output file locations as these can be useful for troubleshooting (if you get a seg fault the error file will say that you ran out of contiguous memory).

Here’s a sample submission script I use for 3D refinement for comparison, obviously some aspects are specific to the system you’re using but hopefully this framework is helpful to you.

Best,

Gavin

#!/bin/bash

#SBATCH --job-name=Refine3D

#SBATCH --nodes=6

#SBATCH --ntasks=24

#SBATCH --mem-per-cpu=15g

#SBATCH --cpus-per-task=5

#SBATCH --ntasks-per-node=4

#SBATCH --partition=gpu

#SBATCH --gres=gpu:tesla:2

#SBATCH --error=/path/to/job/relion_run.err

#SBATCH --output=/path/to/job/relion_run.out

#SBATCH --mail-type=ALL

#SBATCH [log in to unmask]

module load gnu/5.4.0

module load mvapich2/2.2

module load cuda/9.2.148.1

module load relion/3.0_mvapich_enabled_mpi

module load pmix

srun --mpi=pmi2 `which relion_refine_mpi` (refinement command from relion)

From: Collaborative Computational Project in Electron cryo-Microscopy <[log in to unmask]> on behalf of "Markus,Steven" <[log in to unmask]>
Reply-To: "Markus,Steven" <[log in to unmask]>
Date: Thursday, 11 April 2019 at 10:34 am
To: "[log in to unmask]" <[log in to unmask]>
Subject: [ccpem] MPI job on relion with 3D auto-refine

Hello all,

I’ve made some progress on my MPI run problems with Relion on an HPC. I can run MPI jobs for at least 2D averages, and 3D classification. However, when I try to run a 3D auto-refine job using MPI, the job fails after a few seconds, and the slurm output looks like this:

RELION version: 3.0

Precision: BASE=double, CUDA-ACC=single

=== RELION MPI setup ===

+ Number of MPI processes = 4

+ Master (0) runs on host = sgpu0401

+ Slave 1 runs on host = sgpu0401

+ Slave 2 runs on host = sgpu0401

+ Slave 3 runs on host = sgpu0401 =================

uniqueHost sgpu0401 has 3 ranks.

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

Thread 0 on slave 1 mapped to device 0

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

Thread 0 on slave 2 mapped to device 1

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

Thread 0 on slave 3 mapped to device 2

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= PID 16810 RUNNING AT sgpu0401

= EXIT CODE: 139

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= PID 16810 RUNNING AT sgpu0401

= EXIT CODE: 11

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

Intel(R) MPI Library troubleshooting guide:

https://software.intel.com/node/561764

===================================================================================

Here’s the command that was submitted:

#!/bin/bash

#SBATCH --job-name=RELION_GUI_TASK

#SBATCH --time=24:00:00

#SBATCH --nodes=2

#SBATCH --ntasks=4

#SBATCH --cpus-per-task=1

#SBATCH --ntasks-per-node=2

#SBATCH --partition=sgpu

#SBATCH --distribution=arbitrary

cat $0 > /[log in to unmask]/WTdynein/myscript.$SLURM_JOB_ID

module load intel impi cuda/9.1.85

mpirun -n 4 `which relion_refine_mpi` --o Refine3D/job086/run --auto_refine --split_random_halves --i Extract/job080/particles.star --ref Class3D/job041/run_it025_class001_box256.mrc --firstiter_cc --i$

As usual, any and all help would be greatly appreciated!

Steven

To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1