Hi Steven, My experience with exit code 11 is that it indicates that your job is hitting a segmentation fault. There’s a few different notations out there for setting up jobs to run on SLURM but for relion jobs I typically specify number of tasks, cpus per task, and memory per cpu to know exactly how much RAM each MPI rank has to work with. I also tend to specify error and output file locations as these can be useful for troubleshooting (if you get a seg fault the error file will say that you ran out of contiguous memory). Here’s a sample submission script I use for 3D refinement for comparison, obviously some aspects are specific to the system you’re using but hopefully this framework is helpful to you. Best, Gavin #!/bin/bash #SBATCH --job-name=Refine3D #SBATCH --nodes=6 #SBATCH --ntasks=24 #SBATCH --mem-per-cpu=15g #SBATCH --cpus-per-task=5 #SBATCH --ntasks-per-node=4 #SBATCH --partition=gpu #SBATCH --gres=gpu:tesla:2 #SBATCH --error=/path/to/job/relion_run.err #SBATCH --output=/path/to/job/relion_run.out #SBATCH --mail-type=ALL #SBATCH [log in to unmask] module load gnu/5.4.0 module load mvapich2/2.2 module load cuda/9.2.148.1 module load relion/3.0_mvapich_enabled_mpi module load pmix srun --mpi=pmi2 `which relion_refine_mpi` (refinement command from relion) From: Collaborative Computational Project in Electron cryo-Microscopy <[log in to unmask]> on behalf of "Markus,Steven" <[log in to unmask]> Reply-To: "Markus,Steven" <[log in to unmask]> Date: Thursday, 11 April 2019 at 10:34 am To: "[log in to unmask]" <[log in to unmask]> Subject: [ccpem] MPI job on relion with 3D auto-refine Hello all, I’ve made some progress on my MPI run problems with Relion on an HPC. I can run MPI jobs for at least 2D averages, and 3D classification. However, when I try to run a 3D auto-refine job using MPI, the job fails after a few seconds, and the slurm output looks like this: RELION version: 3.0 Precision: BASE=double, CUDA-ACC=single === RELION MPI setup === + Number of MPI processes = 4 + Master (0) runs on host = sgpu0401 + Slave 1 runs on host = sgpu0401 + Slave 2 runs on host = sgpu0401 + Slave 3 runs on host = sgpu0401 ================= uniqueHost sgpu0401 has 3 ranks. GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on slave 1 mapped to device 0 GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on slave 2 mapped to device 1 GPU-ids not specified for this rank, threads will automatically be mapped to available devices. Thread 0 on slave 3 mapped to device 2 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 16810 RUNNING AT sgpu0401 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 16810 RUNNING AT sgpu0401 = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== Intel(R) MPI Library troubleshooting guide: https://software.intel.com/node/561764 =================================================================================== Here’s the command that was submitted: #!/bin/bash #SBATCH --job-name=RELION_GUI_TASK #SBATCH --time=24:00:00 #SBATCH --nodes=2 #SBATCH --ntasks=4 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-node=2 #SBATCH --partition=sgpu #SBATCH --distribution=arbitrary cat $0 > [log in to unmask]<mailto:[log in to unmask]>/WTdynein/myscript.$SLURM_JOB_ID module load intel impi cuda/9.1.85 mpirun -n 4 `which relion_refine_mpi` --o Refine3D/job086/run --auto_refine --split_random_halves --i Extract/job080/particles.star --ref Class3D/job041/run_it025_class001_box256.mrc --firstiter_cc --i$ As usual, any and all help would be greatly appreciated! Steven ________________________________ To unsubscribe from the CCPEM list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1 ######################################################################## To unsubscribe from the CCPEM list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1