Hi,
I have an issue with Scipion where when I launch a Relion 3D-Classification run, where when it starts the 3rd iteration it crashes and complains of a lack of GPU memory. Our system consists of 2x NVIDIA Titan XP and 1x NVIDIA Quadro P5000 GPU cards. I've had this error message come up sporadically for other runs but used to go away when I relaunched the run, but when I try this specific C1 3D-classification it crashes. Is there something one can do to manage the memory usage better? I've attached the error message below.
Many Thanks,
Chris
Expectation iteration 3 of 25
00128: 4.92/22.73 min ............~~(,_,">ERROR: CudaCustomAllocator out of memory
00129: [requestedSpace: 60989440 B]
00130: [largestContinuousFreeSpace: 22902272 B]
00131: [totalFreeSpace: 53996032 B]
00132: [512B] (36864B) (165888B) (36864B) (512B) (512B) (512B) (512B) [3584B] (512B) [12800B] (165888B) (36864B) (165888B) (36864B) (165888B) (36864B) (165888B) (36864B) (165888B) [56320B] (110592B) (110592B) (110592B) (110592B) (110592B) [471040B] (108032B) (109056B) (108032B) (109056B) (108032B) (109056B) (108032B) (109056B) (108032B) (109056B) (70753792B) (141507072B) (141507072B) (141507072B) (141507072B) (283014144B) [7581184B] (69500928B) (139001856B) (139001856B) (139001856B) (139001856B) (278003712B) [7446528B] (73374720B) (146748928B) (146748928B) (146748928B) (146748928B) (293497344B) [7861760B] (71493120B) (142985728B) (142985728B) (142985728B) (142985728B) (285970944B) [7660032B] (72887808B) (145775616B) (145775616B) (145775616B) (145775616B) (291551232B) (48721920B) (48721920B) (54879232B) (54879232B) (57516544B) (57516544B) (61488128B) (61488128B) [22902272B] = 5156189696B
00133: [localhost:15970] *** Process received signal ***
00134: [localhost:15970] Signal: Segmentation fault (11)
00135: [localhost:15970] Signal code: Address not mapped (1)
00136: [localhost:15970] Failing at address: 0x28
00137: [localhost:15970] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f84fa7fc370]
00138: [localhost:15970] [ 1] /usr/local/scipion/software/em/relion-2.0/lib/librelion_lib.so(_ZN13CudaGlobalPtrIfLb1EE11free_if_setEv+0x40)[0x7f850526b0d0]
00139: [localhost:15970] [ 2] /usr/local/scipion/software/em/relion-2.0/lib/librelion_gpu_util.so(_Z37convertAllSquaredDifferencesToWeightsIfEvjR21OptimisationParamtersR18SamplingParametersP11MlOptimiserP15MlOptimiserCudaRSt6vectorI16IndexedDataArraySaIS9_EERS8_IS8_I20IndexedDataArrayMaskSaISD_EESaISF_EER13CudaGlobalPtrIfLb1EEb+0x3267)[0x7f84fb519f67]
00140: [localhost:15970] [ 3] /usr/local/scipion/software/em/relion-2.0/lib/librelion_gpu_util.so(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0x167a)[0x7f84fb4f2eca]
00141: [localhost:15970] [ 4] /usr/local/scipion/software/em/relion-2.0/lib/librelion_lib.so(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0x28)[0x7f850526aa78]
00142: [localhost:15970] [ 5] /usr/local/scipion/software/em/relion-2.0/lib/librelion_lib.so(_Z11_threadMainPv+0x1d)[0x7f850529d86d]
00143: [localhost:15970] [ 6] /lib64/libpthread.so.0(+0x7dc5)[0x7f84fa7f4dc5]
00144: [localhost:15970] [ 7] /lib64/libc.so.6(clone+0x6d)[0x7f84fa52373d]
00145: [localhost:15970] *** End of error message ***
00146: --------------------------------------------------------------------------
00147: mpirun noticed that process rank 1 with PID 15970 on node localhost exited on signal 11 (Segmentation fault).
00148: --------------------------------------------------------------------------
00149: Traceback (most recent call last):
00150: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 182, in run
00151: self._run()
00152: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 228, in _run
00153: resultFiles = self._runFunc()
00154: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 224, in _runFunc
00155: return self._func(*self._args)
00156: File "/usr/local/scipion/pyworkflow/em/packages/relion/protocol_base.py", line 741, in runRelionStep
00157: self.runJob(self._getProgram(), params)
00158: File "/usr/local/scipion/pyworkflow/protocol/protocol.py", line 1077, in runJob
00159: self._stepsExecutor.runJob(self._log, program, arguments, **kwargs)
00160: File "/usr/local/scipion/pyworkflow/protocol/executor.py", line 56, in runJob
00161: env=env, cwd=cwd)
00162: File "/usr/local/scipion/pyworkflow/utils/process.py", line 51, in runJob
00163: return runCommand(command, env, cwd)
00164: File "/usr/local/scipion/pyworkflow/utils/process.py", line 65, in runCommand
00165: check_call(command, shell=True, stdout=sys.stdout, stderr=sys.stderr, env=env, cwd=cwd)
00166: File "/usr/local/scipion/software/lib/python2.7/subprocess.py", line 540, in check_call
00167: raise CalledProcessError(retcode, cmd)
00168: CalledProcessError: Command 'mpirun -np 3 -bynode `which relion_refine_mpi` --gpu --pool 3 --angpix 1.35 --dont_combine_weights_via_disc --ref Runs/004791_ProtRelionClassify3D/tmp/output_volume.mrc --scale --offset_range 5.0 --ini_high 30.0 --offset_step 2.0 --healpix_order 2 --tau2_fudge 2 --ctf --oversampling 1 --o Runs/004791_ProtRelionClassify3D/extra/relion --i Runs/004791_ProtRelionClassify3D/input_particles.star --iter 25 --zero_mask --norm --firstiter_cc --sym c1 --K 6 --solvent_mask Runs/004119_ProtRelionCreateMask3D/extra/mask.mrc --flatten_solvent --particle_diameter 190 --j 15' returned non-zero exit status 139
00169: Protocol failed: Command 'mpirun -np 3 -bynode `which relion_refine_mpi` --gpu --pool 3 --angpix 1.35 --dont_combine_weights_via_disc --ref Runs/004791_ProtRelionClassify3D/tmp/output_volume.mrc --scale --offset_range 5.0 --ini_high 30.0 --offset_step 2.0 --healpix_order 2 --tau2_fudge 2 --ctf --oversampling 1 --o Runs/004791_ProtRelionClassify3D/extra/relion --i Runs/004791_ProtRelionClassify3D/input_particles.star --iter 25 --zero_mask --norm --firstiter_cc --sym c1 --K 6 --solvent_mask Runs/004119_ProtRelionCreateMask3D/extra/mask.mrc --flatten_solvent --particle_diameter 190 --j 15' returned non-zero exit status 139
00170: FAILED: runRelionStep, step 2
00171: 2017-05-31 10:48:22.696428
00172: ------------------- PROTOCOL FAILED (DONE 2/3)
This email message and any attachments are confidential and intended for use by the addressee(s) only. If you are not the intended recipient, please notify me immediately by replying to this message, and destroy all copies of this message and any attachments. Thank you.
|