hi Chris:
Maybe you can have a try to change the --j 15 to --j 4.
Xiaodi
Sent from my iPhone
> On May 31, 2017, at 7:00 AM, Wolfgang Lugmayr <[log in to unmask]> wrote:
>
> hi,
>
> it seems that you are running 3 mpis (= 1 master + 2 slaves) with relion refine.
> depending on your hardware setup and boxsize you could try to add:
> --gpu "0:1" instead of --gpu
>
> maybe all mpi slaves are packed on 1 gpu.
>
> you can use the command nvidia-smi to see which gpus are in use during the running job.
>
> cheers,
> wolfgang
>
>
>
>> On 05/31/2017 12:01 PM, Christopher Browning wrote:
>> Hi,
>>
>> I have an issue with Scipion where when I launch a Relion 3D-Classification run, where when it starts the 3rd iteration it crashes and complains of a lack of GPU memory. Our system consists of 2x NVIDIA Titan XP and 1x NVIDIA Quadro P5000 GPU cards. I've had this error message come up sporadically for other runs but used to go away when I relaunched the run, but when I try this specific C1 3D-classification it crashes. Is there something one can do to manage the memory usage better? I've attached the error message below.
>>
>> Many Thanks,
>>
>> Chris
>>
>>
>> Expectation iteration 3 of 25
>> 00128: 4.92/22.73 min ............~~(,_,">ERROR: CudaCustomAllocator out of memory
>> 00129: [requestedSpace: 60989440 B]
>> 00130: [largestContinuousFreeSpace: 22902272 B]
>> 00131: [totalFreeSpace: 53996032 B]
>> 00132: [512B] (36864B) (165888B) (36864B) (512B) (512B) (512B) (512B) [3584B] (512B) [12800B] (165888B) (36864B) (165888B) (36864B) (165888B) (36864B) (165888B) (36864B) (165888B) [56320B] (110592B) (110592B) (110592B) (110592B) (110592B) [471040B] (108032B) (109056B) (108032B) (109056B) (108032B) (109056B) (108032B) (109056B) (108032B) (109056B) (70753792B) (141507072B) (141507072B) (141507072B) (141507072B) (283014144B) [7581184B] (69500928B) (139001856B) (139001856B) (139001856B) (139001856B) (278003712B) [7446528B] (73374720B) (146748928B) (146748928B) (146748928B) (146748928B) (293497344B) [7861760B] (71493120B) (142985728B) (142985728B) (142985728B) (142985728B) (285970944B) [7660032B] (72887808B) (145775616B) (145775616B) (145775616B) (145775616B) (291551232B) (48721920B) (48721920B) (54879232B) (54879232B) (57516544B) (57516544B) (61488128B) (61488128B) [22902272B] = 5156189696B
>> 00133: [localhost:15970] *** Process received signal ***
>> 00134: [localhost:15970] Signal: Segmentation fault (11)
>> 00135: [localhost:15970] Signal code: Address not mapped (1)
>> 00136: [localhost:15970] Failing at address: 0x28
>> 00137: [localhost:15970] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f84fa7fc370]
>> 00138: [localhost:15970] [ 1] /usr/local/scipion/software/em/relion-2.0/lib/librelion_lib.so(_ZN13CudaGlobalPtrIfLb1EE11free_if_setEv+0x40)[0x7f850526b0d0]
>> 00139: [localhost:15970] [ 2] /usr/local/scipion/software/em/relion-2.0/lib/librelion_gpu_util.so(_Z37convertAllSquaredDifferencesToWeightsIfEvjR21OptimisationParamtersR18SamplingParametersP11MlOptimiserP15MlOptimiserCudaRSt6vectorI16IndexedDataArraySaIS9_EERS8_IS8_I20IndexedDataArrayMaskSaISD_EESaISF_EER13CudaGlobalPtrIfLb1EEb+0x3267)[0x7f84fb519f67]
>> 00140: [localhost:15970] [ 3] /usr/local/scipion/software/em/relion-2.0/lib/librelion_gpu_util.so(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0x167a)[0x7f84fb4f2eca]
>> 00141: [localhost:15970] [ 4] /usr/local/scipion/software/em/relion-2.0/lib/librelion_lib.so(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0x28)[0x7f850526aa78]
>> 00142: [localhost:15970] [ 5] /usr/local/scipion/software/em/relion-2.0/lib/librelion_lib.so(_Z11_threadMainPv+0x1d)[0x7f850529d86d]
>> 00143: [localhost:15970] [ 6] /lib64/libpthread.so.0(+0x7dc5)[0x7f84fa7f4dc5]
>> 00144: [localhost:15970] [ 7] /lib64/libc.so.6(clone+0x6d)[0x7f84fa52373d]
>> 00145: [localhost:15970] *** End of error message ***
>> 00146: --------------------------------------------------------------------------
>> 00147: mpirun noticed that process rank 1 with PID 15970 on node localhost exited on signal 11 (Segmentation fault).
>> 00148: --------------------------------------------------------------------------
>> 00149: Traceback (most recent call last):
>> 00150: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 182, in run
>> 00151: self._run()
>> 00152: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 228, in _run
>> 00153: resultFiles = self._runFunc()
>> 00154: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 224, in _runFunc
>> 00155: return self._func(*self._args)
>> 00156: File "/usr/local/scipion/pyworkflow/em/packages/relion/protocol_base.py", line 741, in runRelionStep
>> 00157: self.runJob(self._getProgram(), params)
>> 00158: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 1077, in runJob
>> 00159: self._stepsExecutor.runJob(self._log, program, arguments, **kwargs)
>> 00160: File "/usr/local/scipion/pyworkflow/protocol/executor.py", line 56, in runJob
>> 00161: env=env, cwd=cwd)
>> 00162: File "/usr/local/scipion/pyworkflow/utils/process.py", line 51, in runJob
>> 00163: return runCommand(command, env, cwd)
>> 00164: File "/usr/local/scipion/pyworkflow/utils/process.py", line 65, in runCommand
>> 00165: check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=env, cwd=cwd)
>> 00166: File "/usr/local/scipion/software/lib/python2.7/subprocess.py", line 540, in check_call
>> 00167: raise CalledProcessError(retcode, cmd)
>> 00168: CalledProcessError: Command 'mpirun -np 3 -bynode `which relion_refine_mpi` --gpu --pool 3 --angpix 1.35 --dont_combine_weights_via_disc --ref Runs/004791_ProtRelionClassify3D/tmp/output_volume.mrc --scale --offset_range 5.0 --ini_high 30.0 --offset_step 2.0 --healpix_order 2 --tau2_fudge 2 --ctf --oversampling 1 --o Runs/004791_ProtRelionClassify3D/extra/relion --i Runs/004791_ProtRelionClassify3D/input_particles.star --iter 25 --zero_mask --norm --firstiter_cc --sym c1 --K 6 --solvent_mask Runs/004119_ProtRelionCreateMask3D/extra/mask.mrc --flatten_solvent --particle_diameter 190 --j 15' returned non-zero exit status 139
>> 00169: Protocol failed: Command 'mpirun -np 3 -bynode `which relion_refine_mpi` --gpu --pool 3 --angpix 1.35 --dont_combine_weights_via_disc --ref Runs/004791_ProtRelionClassify3D/tmp/output_volume.mrc --scale --offset_range 5.0 --ini_high 30.0 --offset_step 2.0 --healpix_order 2 --tau2_fudge 2 --ctf --oversampling 1 --o Runs/004791_ProtRelionClassify3D/extra/relion --i Runs/004791_ProtRelionClassify3D/input_particles.star --iter 25 --zero_mask --norm --firstiter_cc --sym c1 --K 6 --solvent_mask Runs/004119_ProtRelionCreateMask3D/extra/mask.mrc --flatten_solvent --particle_diameter 190 --j 15' returned non-zero exit status 139
>> 00170: FAILED: runRelionStep, step 2
>> 00171: 2017-05-31 10:48:22.696428
>> 00172: ------------------- PROTOCOL FAILED (DONE 2/3)
>>
>>
>> This email message and any attachments are confidential and intended for use by the addressee(s) only. If you are not the intended recipient, please notify me immediately by replying to this message, and destroy all copies of this message and any attachments. Thank you.
>
>
> --
> Universitätsklinikum Hamburg-Eppendorf (UKE)
> @ Centre for Structral Systems Biology (CSSB)
> @ Institute of Molecular Biotechnology (IMBA)
> Dr. Bohr-Gasse 3-7 (Room 6.14)
> 1030 Vienna, Austria
> Tel.: +43 (1) 790 44-4649
> Email: [log in to unmask]
> http://www.cssb-hamburg.de/
>
> --
>
> _____________________________________________________________________
>
> Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; Gerichtsstand: Hamburg | www.uke.de
> Vorstandsmitglieder: Prof. Dr. Burkhard Göke (Vorsitzender), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Prölß, Rainer Schoppik
> _____________________________________________________________________
>
> SAVE PAPER - THINK BEFORE PRINTING
|