On Mon, 18 Mar 2019 at 06:55, Paul Wright <[log in to unmask]> wrote:

Dear Moises

An update on my problem: I have got probtrackx2_gpu to run on the SGE by explicitly selecting the right version of cuda and by increasing the RAM allocated to the job to 32G.

A couple of follow-up questions:
1) The GPU RAM use is close to the limit at 9456 / 11441MiB max for our card. Should allowing more system RAM for the job take pressure of the GPU RAM?
2) My job took 40 minutes to complete, vs 20 minutes with the non-gpu version of probtrackx2. Is there something I can change to improve this, since it is expected to run faster?

Some info about the job:
I am running in network with 78 seed masks, plus waypoints of all white matter and avoid masks of the ventricles. I am not saving fdt_paths images, just the fdt_network_matrix. The DWI data are in 2 mm voxels, with seed masks resampled to DWI space so no transformations applied on-the-fly.

It may be that, in this case my job will run faster using CPU than GPU, since we only have a single cuda machine on the grid, but if you can think of anything I can look at that might speed up the GPU job, I'll try it out.

Best wishes

Paul

On Sat, 2 Mar 2019 21:11:23 -0800, Moises Hernandez <[log in to unmask]> wrote:

>Hi Paul,
>I think the jobs are using CUDA 10.0:
>>> Cuda compilation tools, release 10.0, V10.0.130
>but the latest released versions of the tool were CUDA 9.2 & CUDA 9.1 (
>https://users.fmrib.ox.ac.uk/~moisesf/Probtrackx_GPU/Installation.html)
>so what I would try is to install CUDA 9.2 on that machine and make the
>jobs to use that version.
>You can have different versions of CUDA on the same machine.
>
>
>On Sat, 2 Mar 2019 at 10:02, Paul Wright <
>[log in to unmask]> wrote:
>
>> Hi Moises
>>
>> Our sysadmin installed the version of probtrackx2_gpu that was appropriate
>> for our cuda machine's version. I will check with him that the versions are
>> still in sync (ie no cuda update). Assuming versioning is correct, is there
>> anything else I can do to diagnose? It's a mysterious error, is there seems
>> to be plenty of memory free, and I sent it a job with just two seed masks,
>> which shouldn't take up much memory.
>>
>> Thanks
>> Paul
>>
>>
>> On Thu, 28 Feb 2019 12:05:48 -0500, Moises Hernandez <[log in to unmask]>
>> wrote:
>>
>> >Hi Paul,
>> >It sounds to me like a problem related to CUDA binary version and the
>> >architecture of the GPUs.
>> >Are the GPUs different on the SGE machine?
>> >If yes, you may need a different CUDA version of probtrackx2_gpu. Maybe
>> >that one does not support the GPUs of the SGE machine
>> >
>> >Moises
>> >
>> >On Thu, 28 Feb 2019 at 07:30, Paul Wright <
>> >[log in to unmask]> wrote:
>> >
>> >> Dear Moises et al.
>> >>
>> >> I'm using probtrackx2_gpu to run lots of small tracking jobs. My jobs
>> run
>> >> fine on my local Ubuntu machine, with cuda etc. set up, and speed up the
>> >> process noticably compared with probtrackx2. I want to parallelize the
>> >> batch by sending to our Sun Grid Engine, which has a cuda machine
>> >> configured, but I'm getting out of memory errors. I allocated up to 16
>> GB
>> >> to each job, which should be plenty given that my local machine runs
>> them
>> >> with 16 GB RAM, and the grid machine has 125 GB total. Our admin checked
>> >> the logs, and nvidia-smi reports that the job barely used any RAM (copy
>> >> below), so we're trying to figure out what is triggering the error on
>> the
>> >> grid but not on the local machine. (The same job runs OK using the
>> regular,
>> >> non-gpu version of probtrackx2).
>> >>
>> >> Please let me know if you can help diagnose the problem. I'm happy to
>> >> produce whatever logging you need if you tell me how.
>> >>
>> >> Best wishes
>> >>
>> >> Paul Wright
>> >>
>> >> Command:
>> >> /software/system/fsl/fsl-6.0.0/bin/probtrackx2_gpu -s
>> >> /data/stcog05.bedpostX/merged -m
>> /data/stcog05.bedpostX/nodif_brain_mask -x
>> >> /data/stcog05.probtrack/masksSeed.txt -V 2 --dir=/data/stcog05.probtrack
>> >> --forcedir --network
>> --waypoints=/data/stcog05.probtrack/masksWaypoint.txt
>> >> --waycond=OR --onewaycondition
>> >> --avoid=/data/stcog05.probtrack/masks/ventricles --opd -l
>> >>
>> >> stdout:
>> >> PROBTRACKX2 VERSION GPU
>> >> Log directory is: /data/stcog05.probtrackx
>> >> Running in network mode
>> >> Number of Seeds: 2640
>> >> Dimensions Network Matrix: 2 x 2
>> >>
>> >> Time Loading Data: 22 seconds
>> >>
>> >>
>> >> ...................Allocated GPU 0...................
>> >> Free memory at the beginning: 11911102464 ---- Total memory: 11996954624
>> >> Free memory after copying masks: 11465326592 ---- Total memory:
>> 11996954624
>> >> Running 476136 streamlines in parallel using 2 STREAMS
>> >> Total number of streamlines: 13200000
>> >>
>> >> stderr:
>> >> CUDA Runtime Error: out of memory
>> >>
>> >> uname -a
>> >> Linux nanlnx16.iop.kcl.ac.uk 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov
>> 26
>> >> 12:36:06 CST 2018 x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >> hostnamectl
>> >> Static hostname: nanlnx16.iop.kcl.ac.uk
>> >> Icon name: computer
>> >> Machine ID: 183fb3179d0349ed8c4bdc57ca5297ff
>> >> Boot ID: 886d6ba0fd054eb9a3efd995f67fa6a3
>> >> Operating System: Scientific Linux 7.6 (Nitrogen)
>> >> CPE OS Name: cpe:/o:scientificlinux:scientificlinux:7.6:GA
>> >> Kernel: Linux 3.10.0-957.1.3.el7.x86_64
>> >> Architecture: x86-64
>> >>
>> >> modinfo nvidia
>> >> filename:
>> >> /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/video/nvidia.ko
>> >> alias: char-major-195-*
>> >> version: 410.79
>> >> supported: external
>> >> license: NVIDIA
>> >> retpoline: Y
>> >> rhelversion: 7.6
>> >> srcversion: 1283EC37DF82D5A8A902589
>> >> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>> >> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
>> >> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
>> >> depends: ipmi_msghandler
>> >> vermagic: 3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions
>> >> parm: NvSwitchRegDwords:NvSwitch regkey (charp)
>> >> parm: NVreg_Mobile:int
>> >> parm: NVreg_ResmanDebugLevel:int
>> >> parm: NVreg_RmLogonRC:int
>> >> parm: NVreg_ModifyDeviceFiles:int
>> >> parm: NVreg_DeviceFileUID:int
>> >> parm: NVreg_DeviceFileGID:int
>> >> parm: NVreg_DeviceFileMode:int
>> >> parm: NVreg_UpdateMemoryTypes:int
>> >> parm: NVreg_InitializeSystemMemoryAllocations:int
>> >> parm: NVreg_UsePageAttributeTable:int
>> >> parm: NVreg_MapRegistersEarly:int
>> >> parm: NVreg_RegisterForACPIEvents:int
>> >> parm: NVreg_CheckPCIConfigSpace:int
>> >> parm: NVreg_EnablePCIeGen3:int
>> >> parm: NVreg_EnableMSI:int
>> >> parm: NVreg_TCEBypassMode:int
>> >> parm: NVreg_UseThreadedInterrupts:int
>> >> parm: NVreg_EnableStreamMemOPs:int
>> >> parm: NVreg_EnableBacklightHandler:int
>> >> parm: NVreg_EnableUserNUMAManagement:int
>> >> parm: NVreg_MemoryPoolSize:int
>> >> parm: NVreg_KMallocHeapMaxSize:int
>> >> parm: NVreg_VMallocHeapMaxSize:int
>> >> parm: NVreg_IgnoreMMIOCheck:int
>> >> parm: NVreg_RegistryDwords:charp
>> >> parm: NVreg_RegistryDwordsPerDevice:charp
>> >> parm: NVreg_RmMsg:charp
>> >> parm: NVreg_GpuBlacklist:charp
>> >> parm: NVreg_AssignGpus:charp
>> >>
>> >> nvcc --version
>> >> nvcc: NVIDIA (R) Cuda compiler driver
>> >> Copyright (c) 2005-2018 NVIDIA Corporation
>> >> Built on Sat_Aug_25_21:08:01_CDT_2018
>> >> Cuda compilation tools, release 10.0, V10.0.130
>> >>
>> >> qacct -u k1347787 -j \* -b 201902221200 -q cuda
>> >> ==============================================================
>> >> qname cuda
>> >> hostname nanlnx16.iop.kcl.ac.uk
>> >> group image
>> >> owner k1347787
>> >> project NONE
>> >> department defaultdepartment
>> >> jobname fscon3vprobtrackx_gpu.job
>> >> jobnumber 4422736
>> >> taskid 1
>> >> account sge
>> >> priority 0
>> >> qsub_time Fri Feb 22 13:17:10 2019
>> >> start_time Fri Feb 22 13:17:16 2019
>> >> end_time Fri Feb 22 13:17:50 2019
>> >> granted_pe NONE
>> >> slots 1
>> >> failed 0
>> >> exit_status 0
>> >> ru_wallclock 34s
>> >> ru_utime 23.006s
>> >> ru_stime 5.679s
>> >> ru_maxrss 5.473MB
>> >> ru_ixrss 0.000B
>> >> ru_ismrss 0.000B
>> >> ru_idrss 0.000B
>> >> ru_isrss 0.000B
>> >> ru_minflt 1541199
>> >> ru_majflt 103
>> >> ru_nswap 0
>> >> ru_inblock 665408
>> >> ru_oublock 19016
>> >> ru_msgsnd 0
>> >> ru_msgrcv 0
>> >> ru_nsignals 0
>> >> ru_nvcsw 13074
>> >> ru_nivcsw 1725
>> >> cpu 28.685s
>> >> mem 26.492GBs
>> >> io 332.312MB
>> >> iow 0.000s
>> >> maxvmem 4.477GB
>> >> arid undefined
>> >> ar_sub_time undefined
>> >> category -u k1347787 -q cuda -l h_vmem=16G
>> >>
>> >> ########################################################################
>> >>
>> >> To unsubscribe from the FSL list, click the following link:
>> >> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>> >>
>> >
>> >########################################################################
>> >
>> >To unsubscribe from the FSL list, click the following link:
>> >https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>> >
>>
>> ########################################################################
>>
>> To unsubscribe from the FSL list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>>
>
>########################################################################
>
>To unsubscribe from the FSL list, click the following link:
>https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>

########################################################################

To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1