JISCMail - FSL Archives

here are some hardware/driver info from one of our nodes:

==============NVSMI LOG==============

Timestamp : Thu Apr 10 09:51:10 2014

Driver Version : 319.82

Attached GPUs : 3

GPU 0000:08:00.0

Product Name : Tesla K20m

Display Mode : Disabled

Display Active : Disabled

Persistence Mode : Enabled

Accounting Mode : Disabled

Accounting Mode Buffer Size : 128

Driver Model

Current : N/A

Pending : N/A

VBIOS Version : 80.10.39.00.04

Inforom Version

Image Version : 2081.0208.01.09

OEM Object : 1.1

ECC Object : 3.0

Power Management Object : N/A

GPU Operation Mode

Current : Compute

Pending : Compute

----

cheers,

satra

On Thu, Apr 10, 2014 at 9:19 AM, gong jinnan <[log in to unmask]> wrote:

Hi Jonathan,
I had run bespostx_gpu successfully only once on DWI data which was composed by only 21 gradient directions. But unfortunately, it was never done without error on DWI data which was composed by 60 gradient directions or more.

And interesting, I found that it crashed in the high probability if I interact with the computer when it’s running bedposts_gpu. It that happened in your situation? Moises helped me to check the logs, and found that RAM of GPU was enough to do my job, so I am wondering is that because of CentOS or CPU too?

Jinnan.

在 2014年4月10日，20:57，Jonathan Berrebi <[log in to unmask]> 写道：

Hi,
I have the exact same error when running bedpostx_gpu on a NVIDIA tesla card with cuda 5.5. I installed it on a centos 6.5 computer though. Previously I had tested bedpostx_gpu successfully on debian wheezy after downloading from neurodebian. It took a while and a simple modification to make it work, but it worked on a laptop with debian. I start wondering if it has to do with centos. Unfortunatelly I have to have centos because of some other hardware.

I run cuda samples with success on the centos machine. For instance "devicequery" works. A Tesla card has no graphic output so I would have expected it to have no conflict with graphical driver. But as far as I have understood we need the nvidia graphical driver to be installed in order to use cuda. Then I assume (but I am not sure since I am no expert in graphical devices) that we should not declare the nvidia driver in xorg.conf. Is that correct? (or should I run nvidia -xconfig?).

Anyway the way to install cuda has slightly changed now since you don't need to download the driver before. Cuda toolkit will ask you if you want to do it. Maybe something goes wrong there.

I am sorry if I bring more confusion than solutions but I had the exact same error this week when I got the Tesla card and I have had many thoughts since then about what can have gone wrong.

Thank you,

Jonathan

________________________________________
De : FSL - FMRIB's Software Library [[log in to unmask]] de la part de Moises Hernandez Fernandez [[log in to unmask]]
Envoyé : mercredi 9 avril 2014 15:09
À : [log in to unmask]
Objet : Re: [FSL] Bedpostx_gpu couldn't run.

It should not be a temperature problem unless your GPU has a hardware problem. Any video game increases the temperature of the GPU more than bedpostX.

Have you tried to run some CUDA samples from the toolkit ?

You can check the temperature and the memory being used every second by doing:
nvidia-smi -l 1
(you can do in a new terminal while Bedpostx is running).

Then, you can see if the process is close to the memory limit.

Could you share the output directory ?

Moises.