Dear Rob,

yes, the way eddy gets a device used to work well, but for more recent CUDA environments there has been problems. I need to get my head around what works (and doesn’t work) nowadays and see if I can fix this. It is on the todo-list.

Jesper

On 29 Jun 2018, at 23:42, Reid, Robert I. (Rob) <[log in to unmask]> wrote:

Hi,
 
I am using eddy_cuda8.0-5.0.11prerelease with Open Grid Engine in a queue where the servers have either 2 or 4 GPUs.  The default behavior is that the jobs all run on only the first GPU of each server, and if that GPU runs out of memory it dumps core*.  I am able to spread them out better by hand using CUDA_VISIBLE_DEVICES, e.g.
 
CUDA_VISIBLE_DEVICES=3 /home/apps/packages/fsl/eddy_cuda8.0-5.0.11prerelease –imain…
 
gets it to run on the 3rd (4th if you’re a matlab programmer) GPU.  It still prints “...................Allocated GPU # 0..................." no matter which GPU it is running on, so I guess CUDA_VISIBLE_DEVICES is literally restricting the pool of GPUs it can see, but the fact that eddy is printing the message makes me wonder if more multi-GPU awareness is in the works for eddy_cuda.
 
It looks like using https://pypi.org/project/nvidia-ml-py/ in my eddy wrapper to pick the GPU with the fewest running jobs will be easy, but going beyond that to check how much RAM is left could be tough.  Does anybody have any experience with this?
 
* the error is
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
  what():  std::bad_alloc: out of memory
 
Thanks,
 
     Rob
 
--
Robert I. Reid, Ph.D., Sr. Analyst/Programmer, Information Technology
Aging and Dementia Imaging Research, Mayo Clinic
200 First Street SW, Rochester, MN 55905, mayoclinic.org


To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1




To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1