Dear all,
I came across an out of memory issue when I executed the parallel processing part of bedpostx_gpu for a very small dataset on a cluster having 4 K80 GPUs per node and SLURM as the resource manager.
The submission command and the error is as follows:
srun .../FSL/5.0.9/fsl/bin/xfibres_gpu --data=../FSL/bedpostx-gpu/develgpus_smalldata/PD_20150115.bedpostX/data_0 --mask=../FSL/bedpostx-gpu/develgpus_smalldata/PD_20150115.bedpostX/nodif_brain_mask -b ../FSL/bedpostx-gpu/develgpus_smalldata/PD_20150115.bedpostX/bvals -r ../FSL/bedpostx-gpu/develgpus_smalldata/PD_20150115.bedpostX/bvecs --forcedir --logdir=../FSL/bedpostx-gpu/develgpus_smalldata/PD_20150115.bedpostX/diff_parts/data_part_0000 --nf=3 --fudge=1 --bi=1000 --nj=1250 --se=25 --model=2 --cnonlinear ../FSL/bedpostx-gpu/develgpus_smalldata/PD_20150115 0 1 8000
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what(): std::bad_alloc: out of memory
srun: error: jrc0002: task 15: Aborted
Here, I set the the variable njobs=1 instead of the default value 4.
Each node has a memory of 128 GB and K80 dual-GPU offers 24 GB graphics memory. We use the CUDA version 7.5.18
Could anyone give me a hint on solving this issue?
Many thanks and regards,
Lekshmi
|