Hi Steven,

 

My experience with exit code 11 is that it indicates that your job is hitting a segmentation fault. There’s a few different notations out there for setting up jobs to run on SLURM but for relion jobs I typically specify number of tasks, cpus per task, and memory per cpu to know exactly how much RAM each MPI rank has to work with. I also tend to specify error and output file locations as these can be useful for troubleshooting (if you get a seg fault the error file will say that you ran out of contiguous memory).

 

Here’s a sample submission script I use for 3D refinement for comparison, obviously some aspects are specific to the system you’re using but hopefully this framework is helpful to you.

 

Best,

Gavin

 

#!/bin/bash

#SBATCH --job-name=Refine3D

#SBATCH --nodes=6

#SBATCH --ntasks=24

#SBATCH --mem-per-cpu=15g

#SBATCH --cpus-per-task=5

#SBATCH --ntasks-per-node=4

#SBATCH --partition=gpu

#SBATCH --gres=gpu:tesla:2

#SBATCH --error=/path/to/job/relion_run.err

#SBATCH --output=/path/to/job/relion_run.out

#SBATCH --mail-type=ALL

#SBATCH [log in to unmask]

 

module load gnu/5.4.0

module load mvapich2/2.2

module load cuda/9.2.148.1

module load relion/3.0_mvapich_enabled_mpi

module load pmix

 

srun --mpi=pmi2 `which relion_refine_mpi` (refinement command from relion)

 

 

 

From: Collaborative Computational Project in Electron cryo-Microscopy <[log in to unmask]> on behalf of "Markus,Steven" <[log in to unmask]>
Reply-To: "Markus,Steven" <[log in to unmask]>
Date: Thursday, 11 April 2019 at 10:34 am
To: "[log in to unmask]" <[log in to unmask]>
Subject: [ccpem] MPI job on relion with 3D auto-refine

 

Hello all,

 

I’ve made some progress on my MPI run problems with Relion on an HPC. I can run MPI jobs for at least 2D averages, and 3D classification. However, when I try to run a 3D auto-refine job using MPI, the job fails after a few seconds, and the slurm output looks like this:

 

RELION version: 3.0

Precision: BASE=double, CUDA-ACC=single

 

 === RELION MPI setup ===

 + Number of MPI processes             = 4

 + Master  (0) runs on host            = sgpu0401

 + Slave     1 runs on host            = sgpu0401

 + Slave     2 runs on host            = sgpu0401

 + Slave     3 runs on host            = sgpu0401 =================

 

 uniqueHost sgpu0401 has 3 ranks.

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

 Thread 0 on slave 1 mapped to device 0

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

 Thread 0 on slave 2 mapped to device 1

GPU-ids not specified for this rank, threads will automatically be mapped to available devices.

 Thread 0 on slave 3 mapped to device 2

 

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 16810 RUNNING AT sgpu0401

=   EXIT CODE: 139

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

 

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 16810 RUNNING AT sgpu0401

=   EXIT CODE: 11

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

   Intel(R) MPI Library troubleshooting guide:

      https://software.intel.com/node/561764

===================================================================================



Here’s the command that was submitted:



#!/bin/bash

#SBATCH --job-name=RELION_GUI_TASK

#SBATCH --time=24:00:00

#SBATCH --nodes=2

#SBATCH --ntasks=4

#SBATCH --cpus-per-task=1

#SBATCH --ntasks-per-node=2

#SBATCH --partition=sgpu

#SBATCH --distribution=arbitrary

 

cat $0 > /[log in to unmask]/WTdynein/myscript.$SLURM_JOB_ID

 

module load intel impi cuda/9.1.85

 

mpirun -n 4 `which relion_refine_mpi` --o Refine3D/job086/run --auto_refine --split_random_halves --i Extract/job080/particles.star --ref Class3D/job041/run_it025_class001_box256.mrc --firstiter_cc --i$

 

As usual, any and all help would be greatly appreciated!

Steven

 

 

 

 


To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1



To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1