Hi Steven,
My experience with exit code 11 is that it indicates that your job is hitting a segmentation fault. There’s a few different notations out there for setting up jobs to run on SLURM but for
relion jobs I typically specify number of tasks, cpus per task, and memory per cpu to know exactly how much RAM each MPI rank has to work with. I also tend to specify error and output file locations as these can be useful for troubleshooting (if you get a
seg fault the error file will say that you ran out of contiguous memory).
Here’s a sample submission script I use for 3D refinement for comparison, obviously some aspects are specific to the system you’re using but hopefully this framework is helpful to you.
Best,
Gavin
#!/bin/bash
#SBATCH --job-name=Refine3D
#SBATCH --nodes=6
#SBATCH --ntasks=24
#SBATCH --mem-per-cpu=15g
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=4
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:2
#SBATCH --error=/path/to/job/relion_run.err
#SBATCH --output=/path/to/job/relion_run.out
#SBATCH --mail-type=ALL
#SBATCH [log in to unmask]
module load gnu/5.4.0
module load mvapich2/2.2
module load cuda/9.2.148.1
module load relion/3.0_mvapich_enabled_mpi
module load pmix
srun --mpi=pmi2 `which relion_refine_mpi` (refinement command from relion)
From:
Collaborative Computational Project in Electron cryo-Microscopy <[log in to unmask]> on behalf of "Markus,Steven" <[log in to unmask]>
Reply-To: "Markus,Steven" <[log in to unmask]>
Date: Thursday, 11 April 2019 at 10:34 am
To: "[log in to unmask]" <[log in to unmask]>
Subject: [ccpem] MPI job on relion with 3D auto-refine
Hello all,
I’ve made some progress on my MPI run problems with Relion on an HPC. I can run MPI jobs for at least 2D averages, and 3D classification. However, when I try to run a 3D auto-refine job using MPI, the job fails after a few seconds, and
the slurm output looks like this:
RELION version: 3.0
Precision: BASE=double, CUDA-ACC=single
=== RELION MPI setup ===
+ Number of MPI processes = 4
+ Master (0) runs on host = sgpu0401
+ Slave 1 runs on host = sgpu0401
+ Slave 2 runs on host = sgpu0401
+ Slave 3 runs on host = sgpu0401 =================
uniqueHost sgpu0401 has 3 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on slave 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on slave 2 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on slave 3 mapped to device 2
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 16810 RUNNING AT sgpu0401
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 16810 RUNNING AT sgpu0401
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
===================================================================================
Here’s the command that was submitted:
#!/bin/bash
#SBATCH --job-name=RELION_GUI_TASK
#SBATCH --time=24:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --partition=sgpu
#SBATCH --distribution=arbitrary
cat $0 > /[log in to unmask]/WTdynein/myscript.$SLURM_JOB_ID
module load intel impi cuda/9.1.85
mpirun -n 4 `which relion_refine_mpi` --o Refine3D/job086/run --auto_refine --split_random_halves --i Extract/job080/particles.star --ref Class3D/job041/run_it025_class001_box256.mrc
--firstiter_cc --i$
As usual, any and all help would be greatly appreciated!
Steven
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1