Hi,
> the number of MPI processes being run per node is the
> same as the number of cores the node has which is 40 in the nodes that
> are causing the problem. The job gets spread over 5 nodes with 200 MPI
> processes and 15 with 600.
OK, this is the problem.
For a 40 core node, put 2 MPI processes with 20 threads each.
Best regards,
Takanori Nakane
> Hello,
>
> Thank you for getting back to me so quickly! That makes a lot of
> sense. We have 192 GB per node on the cluster this job is being run
> on. I don't have an exact number of MPI processes per node at the
> moment. I'll have to dig around in the logs for that info but from
> what I can tell, the number of MPI processes being run per node is the
> same as the number of cores the node has which is 40 in the nodes that
> are causing the problem. The job gets spread over 5 nodes with 200 MPI
> processes and 15 with 600.
>
> Sincerely,
> Alex Townsend
> ________________________________
> From: Takanori Nakane <[log in to unmask]>
> Sent: Wednesday, October 9, 2019 8:59 AM
> To: Alex Townsend <[log in to unmask]>
> Cc: [log in to unmask] <[log in to unmask]>
> Subject: Re: [ccpem] C++ Bad Allocation Error with a GPFS Cluster
>
> Hi,
>
> Probably you are running out of memory.
>
> How much memory does a node have and how many MPI processes
> do you run in one node? What is the box size?
>
> Best regards,
>
> Takanori Nakane
>
>> Hello,
>>
>> We are attempting to run RELION on a mostly Intel Cluster running
>> GPFS
>> linked with 10 Gbps InfiniBand fiber. We have been trying to run 200
>> and 600 MPI process Relion jobs on this cluster yet whenever we run the
>> job, within approximately 5 hours, the program crashes and the
>> following error shows up invariably in the logs. It seems that
>> something in relion_refine_mpi runs into an std::bad_alloc() error no
>> matter how many cores it is spread across, even with the --maxsig
>> parameter set to 3000 and the IB retry count increased. Have any of
>> you experienced similar issues? If so, can you give us any suggestions
>> on how to fix it?
>>
>> """
>> terminate called after throwing an instance of 'std::bad_alloc'
>> what(): std::bad_alloc
>> [FSUHPC:131639] *** Process received signal ***
>> [FSUHPC:131639] Signal: Aborted (6)
>> [FSUHPC:131639] Signal code: (-6)
>> [FSUHPC:131639] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x2b315aa3c5e0]
>> [FSUHPC:131639] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2b315b8cc1f7]
>> [FSUHPC:131639] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2b315b8cd8e8]
>> [FSUHPC:131639] [ 3]
>> /usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2b315aeb1ac5]
>> [FSUHPC:131639] [ 4] /usr/lib64/libstdc++.so.6(+0x5ea36)[0x2b315aeafa36]
>> [FSUHPC:131639] [ 5] /usr/lib64/libstdc++.so.6(+0x5ea63)[0x2b315aeafa63]
>> [FSUHPC:131639] [ 6] /usr/lib64/libstdc++.so.6(+0x5ec83)[0x2b315aeafc83]
>> [FSUHPC:131639] [ 7]
>> /usr/lib64/libstdc++.so.6(_Znwm+0x7d)[0x2b315aeb021d]
>> [FSUHPC:131639] [ 8]
>> /usr/lib64/libstdc++.so.6(_Znam+0x9)[0x2b315aeb02b9]
>> [FSUHPC:131639] [ 9]
>> relion_refine_mpi(_ZN13MultidimArrayIdE7reshapeEllll+0x17f)[0x4888cf]
>> [FSUHPC:131639] [10]
>> relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibb+0x267)[0x486a27]
>> [FSUHPC:131639] [11]
>> relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbib+0x68e)[0x54de0e]
>> [FSUHPC:131639] [12]
>> relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x48)[0x565278]
>> [FSUHPC:131639] [13]
>> relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x28f)[0x4585df]
>> [FSUHPC:131639] [14]
>> relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xaa)[0x46501a]
>> [FSUHPC:131639] [15] relion_refine_mpi(main+0xb83)[0x42eee3]
>> [FSUHPC:131639] [16]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b315b8b8c05]
>> [FSUHPC:131639] [17] relion_refine_mpi[0x4322bf]
>> [FSUHPC:131639] *** End of error message ***
>> --------------------------------------------------------------------------
>> Primary job terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> --------------------------------------------------------------------------
>> """
>>
>> Thank you!
>> Sincerely,
>> Alex Townsend
>>
>> Florida State University
>> Research Computing Center
>>
>>
>> ########################################################################
>>
>> To unsubscribe from the CCPEM list, click the following link:
>> https://urldefense.com/v3/__https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1__;!b6vCHzPZcQIc!6H6Y2m_RSpIq2zTy31pRtF5DLc4TZqo8W4Q1ZTRhnBcj6O4ELxr-0xCyTo7rQr4zvg$
>>
>
>
>
> ########################################################################
>
> To unsubscribe from the CCPEM list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
|