Print

Print


Hi Moises

Our sysadmin installed the version of probtrackx2_gpu that was appropriate for our cuda machine's version. I will check with him that the versions are still in sync (ie no cuda update). Assuming versioning is correct, is there anything else I can do to diagnose? It's a mysterious error, is there seems to be plenty of memory free, and I sent it a job with just two seed masks, which shouldn't take up much memory.

Thanks
Paul


On Thu, 28 Feb 2019 12:05:48 -0500, Moises Hernandez <[log in to unmask]> wrote:

>Hi Paul,
>It sounds to me like a problem related to CUDA binary version and the
>architecture of the GPUs.
>Are the GPUs different on the SGE machine?
>If yes, you may need a different CUDA version of probtrackx2_gpu. Maybe
>that one does not support the GPUs of the SGE machine
>
>Moises
>
>On Thu, 28 Feb 2019 at 07:30, Paul Wright <
>[log in to unmask]> wrote:
>
>> Dear Moises et al.
>>
>> I'm using probtrackx2_gpu to run lots of small tracking jobs. My jobs run
>> fine on my local Ubuntu machine, with cuda etc. set up, and speed up the
>> process noticably compared with probtrackx2. I want to parallelize the
>> batch by sending to our Sun Grid Engine, which has a cuda machine
>> configured, but I'm getting out of memory errors. I allocated up to 16 GB
>> to each job, which should be plenty given that my local machine runs them
>> with 16 GB RAM, and the grid machine has 125 GB total. Our admin checked
>> the logs, and nvidia-smi reports that the job barely used any RAM (copy
>> below), so we're trying to figure out what is triggering the error on the
>> grid but not on the local machine. (The same job runs OK using the regular,
>> non-gpu version of probtrackx2).
>>
>> Please let me know if you can help diagnose the problem. I'm happy to
>> produce whatever logging you need if you tell me how.
>>
>> Best wishes
>>
>> Paul Wright
>>
>> Command:
>> /software/system/fsl/fsl-6.0.0/bin/probtrackx2_gpu -s
>> /data/stcog05.bedpostX/merged -m /data/stcog05.bedpostX/nodif_brain_mask -x
>> /data/stcog05.probtrack/masksSeed.txt -V 2 --dir=/data/stcog05.probtrack
>> --forcedir --network --waypoints=/data/stcog05.probtrack/masksWaypoint.txt
>> --waycond=OR --onewaycondition
>> --avoid=/data/stcog05.probtrack/masks/ventricles --opd -l
>>
>> stdout:
>> PROBTRACKX2 VERSION GPU
>> Log directory is: /data/stcog05.probtrackx
>> Running in network mode
>> Number of Seeds: 2640
>> Dimensions Network Matrix: 2 x 2
>>
>> Time Loading Data: 22 seconds
>>
>>
>> ...................Allocated GPU 0...................
>> Free memory at the beginning: 11911102464 ---- Total memory: 11996954624
>> Free memory after copying masks: 11465326592 ---- Total memory: 11996954624
>> Running 476136 streamlines in parallel using 2 STREAMS
>> Total number of streamlines: 13200000
>>
>> stderr:
>> CUDA Runtime Error: out of memory
>>
>> uname -a
>> Linux nanlnx16.iop.kcl.ac.uk 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26
>> 12:36:06 CST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> hostnamectl
>>    Static hostname: nanlnx16.iop.kcl.ac.uk
>>          Icon name: computer
>>         Machine ID: 183fb3179d0349ed8c4bdc57ca5297ff
>>            Boot ID: 886d6ba0fd054eb9a3efd995f67fa6a3
>>   Operating System: Scientific Linux 7.6 (Nitrogen)
>>        CPE OS Name: cpe:/o:scientificlinux:scientificlinux:7.6:GA
>>             Kernel: Linux 3.10.0-957.1.3.el7.x86_64
>>       Architecture: x86-64
>>
>> modinfo nvidia
>> filename:
>>  /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/video/nvidia.ko
>> alias:          char-major-195-*
>> version:        410.79
>> supported:      external
>> license:        NVIDIA
>> retpoline:      Y
>> rhelversion:    7.6
>> srcversion:     1283EC37DF82D5A8A902589
>> alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>> alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
>> alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
>> depends:        ipmi_msghandler
>> vermagic:       3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions
>> parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
>> parm:           NVreg_Mobile:int
>> parm:           NVreg_ResmanDebugLevel:int
>> parm:           NVreg_RmLogonRC:int
>> parm:           NVreg_ModifyDeviceFiles:int
>> parm:           NVreg_DeviceFileUID:int
>> parm:           NVreg_DeviceFileGID:int
>> parm:           NVreg_DeviceFileMode:int
>> parm:           NVreg_UpdateMemoryTypes:int
>> parm:           NVreg_InitializeSystemMemoryAllocations:int
>> parm:           NVreg_UsePageAttributeTable:int
>> parm:           NVreg_MapRegistersEarly:int
>> parm:           NVreg_RegisterForACPIEvents:int
>> parm:           NVreg_CheckPCIConfigSpace:int
>> parm:           NVreg_EnablePCIeGen3:int
>> parm:           NVreg_EnableMSI:int
>> parm:           NVreg_TCEBypassMode:int
>> parm:           NVreg_UseThreadedInterrupts:int
>> parm:           NVreg_EnableStreamMemOPs:int
>> parm:           NVreg_EnableBacklightHandler:int
>> parm:           NVreg_EnableUserNUMAManagement:int
>> parm:           NVreg_MemoryPoolSize:int
>> parm:           NVreg_KMallocHeapMaxSize:int
>> parm:           NVreg_VMallocHeapMaxSize:int
>> parm:           NVreg_IgnoreMMIOCheck:int
>> parm:           NVreg_RegistryDwords:charp
>> parm:           NVreg_RegistryDwordsPerDevice:charp
>> parm:           NVreg_RmMsg:charp
>> parm:           NVreg_GpuBlacklist:charp
>> parm:           NVreg_AssignGpus:charp
>>
>> nvcc --version
>> nvcc: NVIDIA (R) Cuda compiler driver
>> Copyright (c) 2005-2018 NVIDIA Corporation
>> Built on Sat_Aug_25_21:08:01_CDT_2018
>> Cuda compilation tools, release 10.0, V10.0.130
>>
>> qacct -u k1347787 -j \* -b 201902221200 -q cuda
>> ==============================================================
>> qname        cuda
>> hostname     nanlnx16.iop.kcl.ac.uk
>> group        image
>> owner        k1347787
>> project      NONE
>> department   defaultdepartment
>> jobname      fscon3vprobtrackx_gpu.job
>> jobnumber    4422736
>> taskid       1
>> account      sge
>> priority     0
>> qsub_time    Fri Feb 22 13:17:10 2019
>> start_time   Fri Feb 22 13:17:16 2019
>> end_time     Fri Feb 22 13:17:50 2019
>> granted_pe   NONE
>> slots        1
>> failed       0
>> exit_status  0
>> ru_wallclock 34s
>> ru_utime     23.006s
>> ru_stime     5.679s
>> ru_maxrss    5.473MB
>> ru_ixrss     0.000B
>> ru_ismrss    0.000B
>> ru_idrss     0.000B
>> ru_isrss     0.000B
>> ru_minflt    1541199
>> ru_majflt    103
>> ru_nswap     0
>> ru_inblock   665408
>> ru_oublock   19016
>> ru_msgsnd    0
>> ru_msgrcv    0
>> ru_nsignals  0
>> ru_nvcsw     13074
>> ru_nivcsw    1725
>> cpu          28.685s
>> mem          26.492GBs
>> io           332.312MB
>> iow          0.000s
>> maxvmem      4.477GB
>> arid         undefined
>> ar_sub_time  undefined
>> category     -u k1347787 -q cuda -l h_vmem=16G
>>
>> ########################################################################
>>
>> To unsubscribe from the FSL list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>>
>
>########################################################################
>
>To unsubscribe from the FSL list, click the following link:
>https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>

########################################################################

To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1