Hi Takanori,
I think the problem is from our end. The exact same job runs perfectly fine on a different GPU node. Thanks for the help anyways.
Best wishes,
Noor
> On 27 Feb 2019, at 07:54, Takanori Nakane <[log in to unmask]> wrote:
>
> Hi,
>
> Or the file might have got corrupted after being written.
>
> Best regards,
>
> Takanori
>
> On 2019/02/26 21:40, anaa2 wrote:
>> Hi Takanori,
>> I tried to read the volumes using the header command in IMOD and I get the following;
>> ERROR: iiOpen - run_it001_half1_class001.mrc has unknown format. While in a parallel refinement on a separate GPU node it carried on perfectly fine. I wonder how the refinement could have continued with a corrupt mrc file?
>> Best wishes,
>> Noor
>> On 26.02.2019 21:34, Takanori Nakane wrote:
>>> Hi,
>>>
>>> That is strange. Does it happen on other jobs as well?
>>> On my test dataset, I can continue a GPU job in CPU.
>>>
>>> Meanwhile, you might want to re-run Refine3D on GPU
>>> with 'skip padding: Yes'. You cannot continue pad 2
>>> jobs in pad 1 but probably this is faster than refining
>>> on a CPU from the beginning.
>>>
>>> Best regards,
>>>
>>> Takanori Nakane
>>>
>>> On 2019/02/26 21:26, anaa2 wrote:
>>>> Hi,
>>>>
>>>> I've tried that but the same error message keeps on appearing.
>>>>
>>>> Best wishes,
>>>> Noor
>>>>
>>>>
>>>>
>>>> On 26.02.2019 21:16, Takanori Nakane wrote:
>>>>> Hi,
>>>>>
>>>>> This happens when some files necessary for continuation are
>>>>> missing, because the previous job crashed before writing all files.
>>>>> Try continuing from one of earlier iterations.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Takanori Nakane
>>>>>
>>>>> On 2019/02/26 21:12, anaa2 wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've performed a Refine3D job after a bayesian polishing job with the stable RELION-3.0 version. The job crashed due to lack of memory on the GPUs so I tried to continue the job from the last iteration on CPUs and it immediately crashed with the following error message below.
>>>>>>
>>>>>> Best wishes,
>>>>>> Noor
>>>>>>
>>>>>>
>>>>>>
>>>>>> =Execute============================================================
>>>>>> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.7/bin/mpirun -np 20 /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-open
>>>>>> mpi/bin/relion_refine_mpi --continue Refine3D/job363/run_it019_optimiser.star --o Refine3D/job363/run_ct19 --don
>>>>>> t_combine_weights_via_disc --no_parallel_disc_io --pool 3 --pad 2 --particle_diameter 340 --solvent_mask MaskCr
>>>>>> eate/job294/mask.mrc --solvent_correct_fsc --j 10
>>>>>> RELION version: 3.0
>>>>>> Precision: BASE=double, VECTOR-ACC=single
>>>>>>
>>>>>> Reading in optimiser.star ...
>>>>>> in: /usr/packages/relion/relion-3.0-20190218-cpu-gnu-openmpi/src/memory.cpp, line 27
>>>>>> in: /usr/packages/relion/relion-3.0-20190218-cpu-gnu-openmpi/src/memory.cpp, line 27
>>>>>> in: /usr/packages/relion/relion-3.0-20190218-cpu-gnu-openmpi/src/memory.cpp, line 27
>>>>>> === Backtrace ===
>>>>>> === Backtrace ===
>>>>>> === Backtrace ===
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cx x1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x5d) [0x48245d]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_Z9askMemorym+0x75) [0x4c3d25
>>>>>> ]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE8readDataEP8_IO_F ILEl8DataTypem+0x3a0) [0x4a13d0]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE7readMRCElbRK8Fil eName+0x3f6) [0x4a1b36]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE5_readERK8FileNam eR13fImageHandlerblbb+0x258) [0x4a3688]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi() [0x4b0ef4]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN7MlModel4readE8FileName+0x 185f) [0x4b8fff]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11MlOptimiser4readE8FileNa mei+0x104a) [0x5d828a]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11MlOptimiser4readEiPPci+0 x256) [0x5ea2c6]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN14MlOptimiserMpi4readEiPPc +0x68) [0x482c78]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(main+0x57) [0x432b37]
>>>>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2abf117af445]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi() [0x4335ef]
>>>>>> ==================
>>>>>> ERROR:
>>>>>> Error in askMemory: Memory allocation size requested is zero!
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basi c_stringIcSt11char_traitsIcESaIcEEES7_l+0x5d) [0x48245d]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_Z9askMemorym+0x75) [0x4c3d25]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE8readDataEP8_IO_FILEl8Data Typem+0x3a0) [0x4a13d0]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE7readMRCElbRK8FileName+0x3 f6) [0x4a1b36]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE5_readERK8FileNameR13fImag eHandlerblbb+0x258) [0x4a3688]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi() [0x4b0ef4]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN7MlModel4readE8FileName+0x185f) [0x
>>>>>> 4b8fff]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11MlOptimiser4readE8FileNamei+0x104 a) [0x5d828a]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11MlOptimiser4readEiPPci+0x256) [0x
>>>>>> 5ea2c6]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN14MlOptimiserMpi4readEiPPc+0x68) [0
>>>>>> x482c78]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(main+0x57) [0x432b37]
>>>>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b7ac4a51445]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi() [0x4335ef]
>>>>>> ==================
>>>>>> ERROR:
>>>>>> Error in askMemory: Memory allocation size requested is zero!
>>>>>> -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 8 in communicator MPI_COMM_WORLD
>>>>>> with errorcode 1.
>>>>>>
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>> -------------------------------------------------------------------------- /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basi c_stringIcSt11char_traitsIcESaIcEEES7_l+0x5d) [0x48245d]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_Z9askMemorym+0x75) [0x4c3d25]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE8readDataEP8_IO_FILEl8Data Typem+0x3a0) [0x4a13d0]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE7readMRCElbRK8FileName+0x3 f6) [0x4a1b36]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN5ImageIdE5_readERK8FileNameR13fImag eHandlerblbb+0x258) [0x4a3688]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi() [0x4b0ef4]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN7MlModel4readE8FileName+0x185f) [0x
>>>>>> 4b8fff]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11MlOptimiser4readE8FileNamei+0x104 a) [0x5d828a]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN11MlOptimiser4readEiPPci+0x256) [0x
>>>>>> 5ea2c6]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(_ZN14MlOptimiserMpi4readEiPPc+0x68) [0
>>>>>> x482c78]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi(main+0x57) [0x432b37]
>>>>>> /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b5796ca3445]
>>>>>> /usr/mbu/software/relion/relion-3.0-20190218-cpu-gnu-openmpi/bin/relion_refine_mpi() [0x4335ef]
>>>>>> ==================
>>>>>> ERROR:
>>>>>> Error in askMemory: Memory allocation size requested is zero!
>>>>>> [sledge01.maas:38049] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
>>>>>> [sledge01.maas:38049] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>>>
>>>>>> ########################################################################
>>>>>>
>>>>>> To unsubscribe from the CCPEM list, click the following link:
>>>>>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>>>>
>>>> ########################################################################
>>>>
>>>> To unsubscribe from the CCPEM list, click the following link:
>>>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>> ########################################################################
>> To unsubscribe from the CCPEM list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
|