On Sat, 2 Mar 2019 at 10:02, Paul Wright <[log in to unmask]> wrote:

Hi Moises

Our sysadmin installed the version of probtrackx2_gpu that was appropriate for our cuda machine's version. I will check with him that the versions are still in sync (ie no cuda update). Assuming versioning is correct, is there anything else I can do to diagnose? It's a mysterious error, is there seems to be plenty of memory free, and I sent it a job with just two seed masks, which shouldn't take up much memory.

Thanks
Paul

On Thu, 28 Feb 2019 12:05:48 -0500, Moises Hernandez <[log in to unmask]> wrote:

>Hi Paul,
>It sounds to me like a problem related to CUDA binary version and the
>architecture of the GPUs.
>Are the GPUs different on the SGE machine?
>If yes, you may need a different CUDA version of probtrackx2_gpu. Maybe
>that one does not support the GPUs of the SGE machine
>
>Moises
>
>On Thu, 28 Feb 2019 at 07:30, Paul Wright <
>[log in to unmask]> wrote:
>
>> Dear Moises et al.
>>
>> I'm using probtrackx2_gpu to run lots of small tracking jobs. My jobs run
>> fine on my local Ubuntu machine, with cuda etc. set up, and speed up the
>> process noticably compared with probtrackx2. I want to parallelize the
>> batch by sending to our Sun Grid Engine, which has a cuda machine
>> configured, but I'm getting out of memory errors. I allocated up to 16 GB
>> to each job, which should be plenty given that my local machine runs them
>> with 16 GB RAM, and the grid machine has 125 GB total. Our admin checked
>> the logs, and nvidia-smi reports that the job barely used any RAM (copy
>> below), so we're trying to figure out what is triggering the error on the
>> grid but not on the local machine. (The same job runs OK using the regular,
>> non-gpu version of probtrackx2).
>>
>> Please let me know if you can help diagnose the problem. I'm happy to
>> produce whatever logging you need if you tell me how.
>>
>> Best wishes
>>
>> Paul Wright
>>
>> Command:
>> /software/system/fsl/fsl-6.0.0/bin/probtrackx2_gpu -s
>> /data/stcog05.bedpostX/merged -m /data/stcog05.bedpostX/nodif_brain_mask -x
>> /data/stcog05.probtrack/masksSeed.txt -V 2 --dir=/data/stcog05.probtrack
>> --forcedir --network --waypoints=/data/stcog05.probtrack/masksWaypoint.txt
>> --waycond=OR --onewaycondition
>> --avoid=/data/stcog05.probtrack/masks/ventricles --opd -l
>>
>> stdout:
>> PROBTRACKX2 VERSION GPU
>> Log directory is: /data/stcog05.probtrackx
>> Running in network mode
>> Number of Seeds: 2640
>> Dimensions Network Matrix: 2 x 2
>>
>> Time Loading Data: 22 seconds
>>
>>
>> ...................Allocated GPU 0...................
>> Free memory at the beginning: 11911102464 ---- Total memory: 11996954624
>> Free memory after copying masks: 11465326592 ---- Total memory: 11996954624
>> Running 476136 streamlines in parallel using 2 STREAMS
>> Total number of streamlines: 13200000
>>
>> stderr:
>> CUDA Runtime Error: out of memory
>>
>> uname -a
>> Linux nanlnx16.iop.kcl.ac.uk 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26
>> 12:36:06 CST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> hostnamectl
>> Static hostname: nanlnx16.iop.kcl.ac.uk
>> Icon name: computer
>> Machine ID: 183fb3179d0349ed8c4bdc57ca5297ff
>> Boot ID: 886d6ba0fd054eb9a3efd995f67fa6a3
>> Operating System: Scientific Linux 7.6 (Nitrogen)
>> CPE OS Name: cpe:/o:scientificlinux:scientificlinux:7.6:GA
>> Kernel: Linux 3.10.0-957.1.3.el7.x86_64
>> Architecture: x86-64
>>
>> modinfo nvidia
>> filename:
>> /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/video/nvidia.ko
>> alias: char-major-195-*
>> version: 410.79
>> supported: external
>> license: NVIDIA
>> retpoline: Y
>> rhelversion: 7.6
>> srcversion: 1283EC37DF82D5A8A902589
>> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>> alias: pci:v000010DEd*sv*sd*bc03sc02i00*
>> alias: pci:v000010DEd*sv*sd*bc03sc00i00*
>> depends: ipmi_msghandler
>> vermagic: 3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions
>> parm: NvSwitchRegDwords:NvSwitch regkey (charp)
>> parm: NVreg_Mobile:int
>> parm: NVreg_ResmanDebugLevel:int
>> parm: NVreg_RmLogonRC:int
>> parm: NVreg_ModifyDeviceFiles:int
>> parm: NVreg_DeviceFileUID:int
>> parm: NVreg_DeviceFileGID:int
>> parm: NVreg_DeviceFileMode:int
>> parm: NVreg_UpdateMemoryTypes:int
>> parm: NVreg_InitializeSystemMemoryAllocations:int
>> parm: NVreg_UsePageAttributeTable:int
>> parm: NVreg_MapRegistersEarly:int
>> parm: NVreg_RegisterForACPIEvents:int
>> parm: NVreg_CheckPCIConfigSpace:int
>> parm: NVreg_EnablePCIeGen3:int
>> parm: NVreg_EnableMSI:int
>> parm: NVreg_TCEBypassMode:int
>> parm: NVreg_UseThreadedInterrupts:int
>> parm: NVreg_EnableStreamMemOPs:int
>> parm: NVreg_EnableBacklightHandler:int
>> parm: NVreg_EnableUserNUMAManagement:int
>> parm: NVreg_MemoryPoolSize:int
>> parm: NVreg_KMallocHeapMaxSize:int
>> parm: NVreg_VMallocHeapMaxSize:int
>> parm: NVreg_IgnoreMMIOCheck:int
>> parm: NVreg_RegistryDwords:charp
>> parm: NVreg_RegistryDwordsPerDevice:charp
>> parm: NVreg_RmMsg:charp
>> parm: NVreg_GpuBlacklist:charp
>> parm: NVreg_AssignGpus:charp
>>
>> nvcc --version
>> nvcc: NVIDIA (R) Cuda compiler driver
>> Copyright (c) 2005-2018 NVIDIA Corporation
>> Built on Sat_Aug_25_21:08:01_CDT_2018
>> Cuda compilation tools, release 10.0, V10.0.130
>>
>> qacct -u k1347787 -j \* -b 201902221200 -q cuda
>> ==============================================================
>> qname cuda
>> hostname nanlnx16.iop.kcl.ac.uk
>> group image
>> owner k1347787
>> project NONE
>> department defaultdepartment
>> jobname fscon3vprobtrackx_gpu.job
>> jobnumber 4422736
>> taskid 1
>> account sge
>> priority 0
>> qsub_time Fri Feb 22 13:17:10 2019
>> start_time Fri Feb 22 13:17:16 2019
>> end_time Fri Feb 22 13:17:50 2019
>> granted_pe NONE
>> slots 1
>> failed 0
>> exit_status 0
>> ru_wallclock 34s
>> ru_utime 23.006s
>> ru_stime 5.679s
>> ru_maxrss 5.473MB
>> ru_ixrss 0.000B
>> ru_ismrss 0.000B
>> ru_idrss 0.000B
>> ru_isrss 0.000B
>> ru_minflt 1541199
>> ru_majflt 103
>> ru_nswap 0
>> ru_inblock 665408
>> ru_oublock 19016
>> ru_msgsnd 0
>> ru_msgrcv 0
>> ru_nsignals 0
>> ru_nvcsw 13074
>> ru_nivcsw 1725
>> cpu 28.685s
>> mem 26.492GBs
>> io 332.312MB
>> iow 0.000s
>> maxvmem 4.477GB
>> arid undefined
>> ar_sub_time undefined
>> category -u k1347787 -q cuda -l h_vmem=16G
>>
>> ########################################################################
>>
>> To unsubscribe from the FSL list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>>
>
>########################################################################
>
>To unsubscribe from the FSL list, click the following link:
>https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>

########################################################################

To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1