JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for CCPEM Archives


CCPEM Archives

CCPEM Archives


CCPEM@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

CCPEM Home

CCPEM Home

CCPEM  December 2018

CCPEM December 2018

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: Multinode RELION jobs via MPI on an LSF cluster

From:

Sjors Scheres <[log in to unmask]>

Reply-To:

Sjors Scheres <[log in to unmask]>

Date:

Thu, 20 Dec 2018 07:29:45 -0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (363 lines)

Dear Kacper,
We often see these problems if multiple different MPI installations are
mixed. Make sure you run relion (mpirun) with the same MPI installation as
you compiled RELION with. To avoid problems, we typically recompile
openMPI for specific use with RELION and do not depend on system
installations of openMPI.
HTH,
Sjors

> Dear Sjors,
>
> Thank you for the useful suggestions!
>
> I have checked, and it turns out that openMPI was simply provided by
> ubuntu, and not compiled for the LSF cluster architecture. I know that our
> HPC team is planning to migrate to Slurm at some point, so I will make
> sure
> that the MPI is properly compiled when this happens.
>
> However, your suggestion to bypass the LSF, and instead go directly with
> specific hosts seems to have almost worked for me. I found that there are
> multiple mpirun instances on my system, and that only one of them starts
> RELION on multiple hosts without failing. However, for some reason,
> instead
> of evoking the -n 2 flag of the mpirun, it starts two instances of -n 1
> per
> host, and therefore the whole set up fails with the following error:
> MlOptimiserMpi::initialiseWorkLoad: at least 2 MPI processes are required,
> otherwise use the sequential program
>
> The log is below. Any help in getting this work would be much appreciated!
>
> Best wishes,
>
> Kacper
>
> mpirun.mpich --hostfile gpuhosts --np 4 `which relion_refine_mpi` \
>> --o /DATASETS/RELION-BENCHMARK/3D-benchmark_15-MF-r16u29-30/ \
>> --i /DATASETS/RELION-BENCHMARK/Particles/shiny_2sets.star \
>> --j 8 \
>> --gpu \
>> --pool 50 \
>> --no_parallel_disc_io \
>> --scratch_dir /tmp \
>> --dont_combine_weights_via_disc \
>> --ref /DATASETS/RELION-BENCHMARK/emd_2660.map:mrc \
>> --firstiter_cc \
>> --ini_high 60 \
>> --ctf \
>> --ctf_corrected_ref \
>> --iter 25 \
>> --tau2_fudge 4 \
>> --particle_diameter 360 \
>> --K 6 \
>> --flatten_solvent \
>> --zero_mask \
>> --oversampling 1 \
>> --healpix_order 2 \
>> --offset_range 5 \
>> --offset_step 2 \
>> --sym C1 \
>> --norm \
>> --scale \
>> --random_seed 0 \
>> --o class3d
> RELION version: 3.0-beta-2
> Precision: BASE=double, CUDA-ACC=single
>
> === RELION MPI setup ===
>  + Number of MPI processes             = 1
>  + Number of threads per MPI process  = 8
>  + Total number of threads therefore  = 8
>  + Master  (0) runs on host            = it-r16u29
>  =================
> RELION version: 3.0-beta-2
> Precision: BASE=double, CUDA-ACC=single
>
>  === RELION MPI setup ===
>  + Number of MPI processes             = 1
>  + Number of threads per MPI process  = 8
>  + Total number of threads therefore  = 8
>  + Master  (0) runs on host            = it-r16u29
>  =================
> RELION version: 3.0-beta-2
> Precision: BASE=double, CUDA-ACC=single
>
>  === RELION MPI setup ===
>  + Number of MPI processes             = 1
>  + Number of threads per MPI process  = 8
>  + Total number of threads therefore  = 8
>  + Master  (0) runs on host            = it-r16u30
>  =================
> RELION version: 3.0-beta-2
> Precision: BASE=double, CUDA-ACC=single
>
>  === RELION MPI setup ===
>  + Number of MPI processes             = 1
>  + Number of threads per MPI process  = 8
>  + Total number of threads therefore  = 8
>  + Master  (0) runs on host            = it-r16u30
>  =================
>  Running CPU instructions in double precision.
>  Running CPU instructions in double precision.
>  Running CPU instructions in double precision.
>  Running CPU instructions in double precision.
>
> in: /home/rogala/cryo-EM/RELION/relion-3.0_beta/src/ml_optimiser_mpi.cpp,
> line 590
> === Backtrace  ===
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41)
> [0x447321]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi18initialiseWorkLoadEv+0x2c8)
> [0x466ca8]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x9c8)
> [0x467d08]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(main+0xb0b)
> [0x433f5b]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fcb2392af45]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi()
> [0x43729f]
> ==================
> ERROR:
> MlOptimiserMpi::initialiseWorkLoad: at least 2 MPI processes are required,
> otherwise use the sequential program
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> in: /home/rogala/cryo-EM/RELION/relion-3.0_beta/src/ml_optimiser_mpi.cpp,
> line 590
> === Backtrace  ===
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41)
> [0x447321]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi18initialiseWorkLoadEv+0x2c8)
> [0x466ca8]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x9c8)
> [0x467d08]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi(main+0xb0b)
> [0x433f5b]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f608037df45]
> /KACPER/cryo-EM/RELION/relion-3.0_beta/build/bin/relion_refine_mpi()
> [0x43729f]
> ==================
> ERROR:
> MlOptimiserMpi::initialiseWorkLoad: at least 2 MPI processes are required,
> otherwise use the sequential program
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:1@it-r16u30] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
> [proxy:0:1@it-r16u30] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:1@it-r16u30] main (./pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [mpiexec@it-r16u29] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec@it-r16u29] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
> for
> completion
> [mpiexec@it-r16u29] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
> completion
> [mpiexec@it-r16u29] main (./ui/mpich/mpiexec.c:331): process manager error
> waiting for completion
>
> On Tue, 11 Dec 2018 at 10:13, Sjors Scheres <[log in to unmask]>
> wrote:
>
>> Dear Kacper,
>>
>> Was openMPI compiled with support for this queuing system?
>>
>> Alternatively, does the queueing system provide an environment variable,
>> or temporary file, with the names of the assigned nodes? If so, you
>> could pass that as a -machinefile option to mpirun.
>>
>> HTH,
>>
>> Sjors
>>
>>
>>
>> On 12/10/2018 10:02 PM, Kacper Rogala wrote:
>> > Hello Everyone,
>> >
>> > I have a question about running Relion via OpenMPI on multiple cluster
>> > nodes. The cluster architecture is LSF, and I have been trying to use
>> > the following test submission script (see below) to distribute jobs
>> > across nodes. The number of GPUs is not recorded in the node
>> > description, so I have been listing specific nodes that I know carry
>> > GPU cards (CUDA 8).
>> >
>> > I can see that LSF reserves computational resources on the nodes that
>> > I requested, but Relion ends up running only on the first node, while
>> > the other nodes stay idle (see the output file at the bottom of this
>> > email).
>> >
>> > Perhaps there is something you can recommend that I can add to either
>> > the submission script or the mpirun command to make this work?
>> >
>> > Also, do you know whether it is worth distributing jobs across nodes
>> > to begin with? Have people observed any significant improvement in
>> > calculation times in such cases, or perhaps the gain is only minimal?
>> >
>> > Any help would be much appreciated!
>> >
>> > Best wishes,
>> >
>> > Kacper
>> >
>> > --
>> > Kacper Rogala, DPhil
>> > Postdoctoral Fellow
>> > Whitehead Institute / MIT
>> >
>> ------------------------------------------------------------------------------
>> > !/bin/sh
>> > #
>> > #BSUB -J benchmark06 # job name
>> > #BSUB -n 4
>> > #BSUB -W 96:00 # Job wall clock limit hh:mm
>> > #BSUB -q gpu
>> > #BSUB -m "it-r16u29 it-r16u30"
>> > #BSUB -R "span[ptile=2]"
>> > #BSUB -x
>> > #BSUB -e errors-%J.log # error file name in which %J is replaced by
>> > the job ID
>> > #BSUB -o output-%J.log # output file name in which %J is replaced by
>> > the job ID
>> > #BSUB -B # email status notifications
>> >
>> > time mpirun -n 4 `which relion_refine_mpi` \
>> > --o /DATASETS/RELION-BENCHMARK/3D-benchmark_06-mpibynode-r16u29-30/ \
>> > --i /DATASETS/RELION-BENCHMARK/Particles/shiny_2sets.star \
>> > --j 6 \
>> > --gpu \
>> > --pool 50 \
>> > --no_parallel_disc_io \
>> > --scratch_dir /tmp \
>> > --dont_combine_weights_via_disc \
>> > --ref /DATASETS/RELION-BENCHMARK/emd_2660.map:mrc \
>> > --firstiter_cc \
>> > --ini_high 60 \
>> > --ctf \
>> > --ctf_corrected_ref \
>> > --iter 25 \
>> > --tau2_fudge 4 \
>> > --particle_diameter 360 \
>> > --K 6 \
>> > --flatten_solvent \
>> > --zero_mask \
>> > --oversampling 1 \
>> > --healpix_order 2 \
>> > --offset_range 5 \
>> > --offset_step 2 \
>> > --sym C1 \
>> > --norm \
>> > --scale \
>> > --random_seed 0 \
>> > --o class3d
>> >
>> > --------------------------------------------------------------
>> > The output log file:
>> >
>> > RELION version: 3.0-beta-2
>> > Precision: BASE=double, CUDA-ACC=single
>> >
>> >  === RELION MPI setup ===
>> >  + Number of MPI processes             = 4
>> >  + Number of threads per MPI process  = 6
>> >  + Total number of threads therefore  = 24
>> >  + Master  (0) runs on host            = it-r16u29
>> >  + Slave     1 runs on host            = it-r16u29
>> >  + Slave     2 runs on host            = it-r16u29
>> >  + Slave     3 runs on host            = it-r16u29
>> >  =================
>> >  uniqueHost it-r16u29 has 3 ranks.
>> > GPU-ids not specified for this rank, threads will automatically be
>> > mapped to available devices.
>> >  Thread 0 on slave 1 mapped to device 0
>> >  Thread 1 on slave 1 mapped to device 0
>> >  Thread 2 on slave 1 mapped to device 0
>> >  Thread 3 on slave 1 mapped to device 0
>> >  Thread 4 on slave 1 mapped to device 0
>> >  Thread 5 on slave 1 mapped to device 0
>> > GPU-ids not specified for this rank, threads will automatically be
>> > mapped to available devices.
>> >  Thread 0 on slave 2 mapped to device 0
>> >  Thread 1 on slave 2 mapped to device 0
>> >  Thread 2 on slave 2 mapped to device 0
>> >  Thread 3 on slave 2 mapped to device 0
>> >  Thread 4 on slave 2 mapped to device 0
>> >  Thread 5 on slave 2 mapped to device 0
>> > GPU-ids not specified for this rank, threads will automatically be
>> > mapped to available devices.
>> >  Thread 0 on slave 3 mapped to device 0
>> >  Thread 1 on slave 3 mapped to device 0
>> >  Thread 2 on slave 3 mapped to device 0
>> >  Thread 3 on slave 3 mapped to device 0
>> >  Thread 4 on slave 3 mapped to device 0
>> >  Thread 5 on slave 3 mapped to device 0
>> > Device 0 on it-r16u29 is split between 3 slaves
>> >  Running CPU instructions in double precision.
>> >  + On host it-r16u29: free scratch space = 66 Gb.
>> >  Copying particles to scratch directory: /tmp/relion_volatile/
>> > ........
>> >
>> >
>> > ------------------------------------------------------------------------
>> >
>> > To unsubscribe from the CCPEM list, click the following link:
>> > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>> >
>>
>> --
>> Sjors Scheres
>> MRC Laboratory of Molecular Biology
>> Francis Crick Avenue, Cambridge Biomedical Campus
>> Cambridge CB2 0QH, U.K.
>> tel: +44 (0)1223 267061
>> http://www2.mrc-lmb.cam.ac.uk/groups/scheres
>>
>>
>
> ########################################################################
>
> To unsubscribe from the CCPEM list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>


-- 
Sjors Scheres
MRC Laboratory of Molecular Biology
Francis Crick Avenue, Cambridge Biomedical Campus
Cambridge CB2 0QH, U.K.
tel: +44 (0)1223 267061
http://www2.mrc-lmb.cam.ac.uk/groups/scheres

########################################################################

To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager