Hello,
Something like two weeks ago my Relion-3.0-beta-2 3D classification run crashed in the middle of 3D classification.
Now I have tried the runs that worked fine before (same data and parameters) and they also keep crushing.
Is there something wrong with GPU?
I have had this system (ubuntu 18.04 with two ASUS rtx 2080 GPU's) for a year, and it has been working fine, no crashes, beside the crashing in the beginning but in the forum I found the solution where you change the
relion file /src/macros.h
changed MPI_DOUBLE_COMPLEX to MPI_C_DOUBLE_COMPLEX
Back to the problem.
For example, the 3D classification of the data that run perfectly 2 weeks ago, crashing at the 2nd iteration, and reports problem (below).
The thing is that the whole computer is frozen, can not do anything, only way is to restart the computer.
I attached two images of nvidia-smi terminal during the crash and after crash. After crash it shows that GPU 1 has error, does it mean that physically there is something wrong with this GPU? After computer restart the GPU 1 is fine again!
3D classification error:
Oversampling= 1 NrHiddenVariableSamplingPoints= 18579456
OrientationalSampling= 7.5 NrOrientations= 36864
TranslationalSampling= 1 NrTranslations= 84
=============================
Expectation iteration 2 of 25
24.12/43.22 min .................................~~(,_,">(9216B) (41472B) (10240B) (45056B) (9728B) (43520B) (9216B) (41472B) (9728B) (42496B) (9216B) (40960B) (369664B) (372224B) (369664B) (372224B) [512B] (1024B) (512B) [512B] (512B) (1024B) (512B) [512B] (369664B) [512B] (512B) (1024B) [3584B] (372224B) (369664B) (512B) (512B) (512B) (512B) <1024B> (512B) [2048B] (372224B) (27648B) (7680B) (5632B) [4608B] (27648B) <9728B> [8704B] (27648B) [7168B] (110592B) (110592B) (369664B) (110592B) (369664B) [961536B] (369664B) (372224B) (2322432B) (592896B) [2903040B] (369664B) (372224B) (8366080B) (8366080B) (6592000B) (6592000B) <8472064B> <6698496B> (4693504B) (4693504B) [34049024B] (6620672B) (13241344B) (13241344B) (13241344B) (13241344B) (26482176B) (9851904B) (19703296B) (19703296B) (19703296B) (19703296B) (39406080B) (8206848B) (16413184B) (16413184B) (16413184B) (16413184B) (32825856B) [6568780800B] = 6990885888B
KERNEL_ERROR: an illegal memory access was encountered in /home/arto/programs/relion3beta/src/acc/utilities.h at line 438 (error-code 700)
(9216B) (41472B) (10240B) (45056B) (9728B) (43520B) (9216B) (41472B) (9728B) (42496B) (9216B) (40960B) (369664B) (372224B) (369664B) (372224B) [512B] <1024B> <512B> [512B] (512B) <1024B> <512B> [512B] (369664B) [512B] (512B) <1024B> [3584B] (372224B) (369664B) (512B) (512B) (512B) (512B) <1024B> <512B> [2048B] (372224B) (27648B) <7680B> <5632B> [4608B] (27648B) <9728B> [8704B] (27648B) [7168B] (110592B) (110592B) <369664B> (110592B) (369664B) [961536B] (369664B) (372224B) <2322432B> <592896B> [2903040B] (369664B) (372224B) (8366080B) (8366080B) (6592000B) (6592000B) <8472064B> <6698496B> (4693504B) (4693504B) [34049024B] (6620672B) (13241344B) (13241344B) (13241344B) (13241344B) (26482176B) (9851904B) (19703296B) (19703296B) (19703296B) (19703296B) (39406080B) (8206848B) (16413184B) (16413184B) (16413184B) (16413184B) (32825856B) [6568780800B] = 6990885888B
(9216B) (41472B) (10240B) (45056B) (9728B) (43520B) (9216B) (41472B) (9728B) (42496B) (9216B) (40960B) (369664B) (372224B) (369664B) (372224B) [512B] <1024B> <512B> [512B] (512B) <1024B> <512B> [512B] (369664B) [512B] (512B) <1024B> [3584B] (372224B) (369664B) (512B) (512B) (512B) (512B) <1024B> <512B> [2048B] (372224B) (27648B) <7680B> <5632B> [4608B] (27648B) <9728B> [8704B] (27648B) [7168B] (110592B) (110592B) <369664B> (110592B) <369664B> [961536B] (369664B) (372224B) <2322432B> <592896B> [2903040B] (369664B) (372224B) (8366080B) (8366080B) (6592000B) (6592000B) <8472064B> <6698496B> (4693504B) (4693504B) [34049024B] (6620672B) (13241344B) (13241344B) (13241344B) (13241344B) (26482176B) (9851904B) (19703296B) (19703296B) (19703296B) (19703296B) (39406080B) (8206848B) (16413184B) (16413184B) (16413184B) (16413184B) (32825856B) [6568780800B] = 6990885888B
2.74/4.73 hrs ..................................~~(,_,">--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node klup exited on signal 11 (Segmentation fault).
or
ERROR: an illegal memory access was encountered in /home/arto/programs/relion3beta/src/acc/acc_helper_functions_impl.h at line 1519 (error-code 700)
ERROR: an illegal memory access was encountered in /home/arto/programs/relion3beta/src/acc/cuda/custom_allocator.cuh at line 176 (error-code 700)
in: /home/arto/programs/relion3beta/src/acc/cuda/cuda_settings.h, line 81
in: /home/arto/programs/relion3beta/src/acc/cuda/cuda_settings.hin: /home/arto/programs/relion3beta/src/acc/cuda/cuda_settings.h, line 67, line 67
in: /home/arto/programs/relion3beta/src/acc/cuda/cuda_settings.h, line 81
[klup:30018] *** Process received signal ***
[klup:30018] Signal: Segmentation fault (11)
[klup:30018] Signal code: Address not mapped (1)
[klup:30018] Failing at address: 0x28
ERROR: an illegal memory access was encountered in /home/arto/programs/relion3beta/src/acc/cuda/custom_allocator.cuh at line 176 (error-code 700)
[klup:30018] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f73c033b890]
[klup:30018] [ 1] in: /home/arto/programs/relion3beta/src/acc/cuda/cuda_settings.hin: /home/arto/programs/relion3beta/build/bin/relion_refine_mpi(_ZN6AccPtrImE9freeIfSetEv+0x48)[0x56181ff43e58]
/home/arto/programs/relion3beta/src/acc/cuda/cuda_settings.h[klup:30018] , line [ 2] 67, line 67
/home/arto/programs/relion3beta/build/bin/relion_refine_mpi(_Z31findThresholdIdxInCumulativeSumIfEmR6AccPtrIT_ES1_+0x48a)[0x56181ff47f2a]
[klup:30018] [ 3]
/home/arto/programs/relion3beta/build/bin/relion_refine_mpi(_Z37convertAllSquaredDifferencesToWeightsI15MlOptimiserCudaEvjR21OptimisationParamtersR18SamplingParametersP11MlOptimiserPT_RSt6vectorI16IndexedDataArraySaISA_EERS9_IS9_I20IndexedDataArrayMaskSaISE_EESaISG_EER6AccPtrIfE13AccPtrFactoryi+0x3560)[0x56181ff72fb0]
[klup:30018] [ 4] /home/arto/programs/relion3beta/build/bin/relion_refine_mpi(+0x207d08)[0x56181ff3cd08]
[klup:30018] [ 5] /home/arto/programs/relion3beta/build/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xeb)[0x56181ff3f19b]
Arto
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
|