On 29 Jun 2018, at 23:42, Reid, Robert I. (Rob) <[log in to unmask]> wrote:

Hi,

I am using eddy_cuda8.0-5.0.11prerelease with Open Grid Engine in a queue where the servers have either 2 or 4 GPUs. The default behavior is that the jobs all run on only the first GPU of each server, and if that GPU runs out of memory it dumps core*. I am able to spread them out better by hand using CUDA_VISIBLE_DEVICES, e.g.

CUDA_VISIBLE_DEVICES=3 /home/apps/packages/fsl/eddy_cuda8.0-5.0.11prerelease –imain…

gets it to run on the 3^rd (4^th if you’re a matlab programmer) GPU. It still prints “...................Allocated GPU # 0..................." no matter which GPU it is running on, so I guess CUDA_VISIBLE_DEVICES is literally restricting the pool of GPUs it can see, but the fact that eddy is printing the message makes me wonder if more multi-GPU awareness is in the works for eddy_cuda.

It looks like using https://pypi.org/project/nvidia-ml-py/ in my eddy wrapper to pick the GPU with the fewest running jobs will be easy, but going beyond that to check how much RAM is left could be tough. Does anybody have any experience with this?

* the error is

terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'

what(): std::bad_alloc: out of memory

Thanks,

Rob

--

Robert I. Reid, Ph.D., Sr. Analyst/Programmer, Information Technology

Aging and Dementia Imaging Research, Mayo Clinic

200 First Street SW, Rochester, MN 55905, mayoclinic.org

To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1