As a quick follow up, if all 4 mpi processes simultaneously really
wanted 52 GB or VM, that's more than we could support, an we'd be
limited to a max of 2 ranks. Fortunately, the intervals where VM usage
shoots up to 52 GB are sporadic and brief. Still, depending on how
things behave, I may find it prudent to cut back to 2 or 3 processes
instead of 4.
Regards,
-jh-
On 1/9/19 4:19 PM, Heumann John wrote:
> Hi Takanori,
>
> I'd already checked this... the motionCor2 processes are distributed 1
> per gpu as you'd hope, and memory usage on each gpu looks well within
> limits. Also, it wouldn't seem to make sense that going to 1 mpi rank
> would change anything if gpu memory was really the limit.
>
> Based on this and the stack trace I'm inclined to think is must be a
> system rather than gpu memory issue. This is a 128 GB system, and I've
> run 4 mpi / 4 gpu motion correction with super-res movies on it quite
> a few times previously with no problems. The motionCor2 processes
> seem to use ~ 25 - 52 GB of VM each, and I see resident set sizes of
> <= 10 GB. My best guess at the moment is that somebody else must have
> run some process(es) which consumed enough memory that we ran out of
> swap space or physical RAM. This seems consistent with all the
> symptoms, including the sporadic behavior, and doesn't require
> assuming a problem in either the wrapper or motionCor2.
>
> This system is currently heavily loaded, but I'll check again when
> it's idle and let you know if the issue goes away.
> If that turns out to be the case, I apologize for the false alarm!
>
> Regards,
> -jh-
>
> On 1/9/19 12:24 PM, Takanori Nakane wrote:
>> Hi,
>>
>> Please check which MotionCor2 process uses which GPU by
>> nvidia-smi.
>>
>> > if I use 1 mpi rank and manually assign 4 gpus with all the other
>> options the same the run completes successfully.
>>
>> Do you have enough memory to run four MotionCor2 processes
>> simultaneously? Super-res K2 movies are very big.
>>
>> Best regards,
>>
>> Takanori Nakane
>>
>> On 2019/01/09 19:06, John Heumann wrote:
>>> Hi Takanori (et al),
>>>
>>> I'm afraid it looks to me like an issue with the motionCor2 mpi
>>> wrapper may have reemerged or been reintroduced. Using yesterdays
>>> (0ad5bd01eb2d1d3b01f65fad1f5c8cbbefe94704) commit on a particular
>>> compressed tiff dataset with super-res K2 movies, if I assign use 4
>>> mpi ranks and 4 gpus, I get a signal 6 (allocation) failure,
>>> regardless of whether I assign the gpus manually with the colon
>>> operator or leave that field blank in the gui and let Relion do the
>>> assignment.
>>>
>>> Conversely, if I use 1 mpi rank and manually assign 4 gpus with all
>>> the other options the same the run completes successfully. (I used
>>> blanks as delimiters out of habit, so I haven't verified Takanori's
>>> recent fix to the wrapper, I assume that would have worked as well).
>>>
>>> The 2 failing runstrings were
>>>
>>> `which relion_run_motioncorr_mpi` --i Import/job001/movies.star --o
>>> MotionCorr/job004/ --first_frame_sum 1 --last_frame_sum -1
>>> --use_motioncor2 --motioncor2_exe /usr/bin/MotionCor2 --gpu ""
>>> --bin_factor 2 --bfactor 150 --angpix 0.692 --voltage 300
>>> --dose_per_frame 0.732 --preexposure 0 --patch_x 5 --patch_y 5
>>> --gainref Movies/gainRef.mrc --gain_rot 0 --gain_flip 0
>>> --dose_weighting --only_do_unfinished
>>>
>>> and
>>>
>>> `which relion_run_motioncorr_mpi` --i Import/job001/movies.star --o
>>> MotionCorr/job006/ --first_frame_sum 1 --last_frame_sum -1
>>> --use_motioncor2 --motioncor2_exe /usr/bin/MotionCor2 --gpu
>>> "0:1:2:3" --bin_factor 2 --bfactor 150 --angpix 0.692 --voltage 300
>>> --dose_per_frame 0.732 --preexposure 0 --patch_x 5 --patch_y 5
>>> --gainref Movies/gainRef.mrc --gain_rot 0 --gain_flip 0
>>> --dose_weighting --only_do_unfinished
>>>
>>> The successful one was
>>>
>>> `which relion_run_motioncorr` --i Import/job001/movies.star --o
>>> MotionCorr/job003/ --first_frame_sum 1 --last_frame_sum -1
>>> --use_motioncor2 --motioncor2_exe /usr/bin/MotionCor2 --gpu "0 1 2
>>> 3" --bin_factor 2 --bfactor 150 --angpix 0.692 --voltage 300
>>> --dose_per_frame 0.732 --preexposure 0 --patch_x 5 --patch_y 5
>>> --gainref Movies/gainRef.mrc --gain_rot 0 --gain_flip 0
>>> --dose_weighting --only_do_unfinished
>>>
>>> I've attached one of the run.err files showing the stack trace.
>>> These are 8 Gb GTX1080s. During at least the early part of the
>>> processing, only 4-5 Gb per gpu seems to be used. Also, a variable
>>> number (~20 - 80) of movies get corrected successfully before the
>>> failure occurs, so perhaps this reflect a memory leak of some kind?
>>>
>>> Of course, running with 1 mpi rank provides an adequate workaround.
>>>
>>> Thanks in advance!
>>>
>>> Regards,
>>> -jh-
>>>
>>>
>>>
>>>
>>> ########################################################################
>>>
>>>
>>> To unsubscribe from the CCPEM list, click the following link:
>>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
>>>
>>
>
--
John M. Heumann
Department of Molecular, Cellular, and Developmental Biology
347 UCB, University of Colorado
Boulder, CO 80309-0347
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
|