Hi Paul,
I think the jobs are using CUDA 10.0:
>> Cuda compilation tools, release 10.0, V10.0.130
but the latest released versions of the tool were CUDA 9.2 & CUDA 9.1 (https://users.fmrib.ox.ac.uk/~moisesf/Probtrackx_GPU/Installation.html)
so what I would try is to install CUDA 9.2 on that machine and make the jobs to use that version.
You can have different versions of CUDA on the same machine.
 

On Sat, 2 Mar 2019 at 10:02, Paul Wright <[log in to unmask]> wrote:
Hi Moises

Our sysadmin installed the version of probtrackx2_gpu that was appropriate for our cuda machine's version. I will check with him that the versions are still in sync (ie no cuda update). Assuming versioning is correct, is there anything else I can do to diagnose? It's a mysterious error, is there seems to be plenty of memory free, and I sent it a job with just two seed masks, which shouldn't take up much memory.

Thanks
Paul


On Thu, 28 Feb 2019 12:05:48 -0500, Moises Hernandez <[log in to unmask]> wrote:

>Hi Paul,
>It sounds to me like a problem related to CUDA binary version and the
>architecture of the GPUs.
>Are the GPUs different on the SGE machine?
>If yes, you may need a different CUDA version of probtrackx2_gpu. Maybe
>that one does not support the GPUs of the SGE machine
>
>Moises
>
>On Thu, 28 Feb 2019 at 07:30, Paul Wright <
>[log in to unmask]> wrote:
>
>> Dear Moises et al.
>>
>> I'm using probtrackx2_gpu to run lots of small tracking jobs. My jobs run
>> fine on my local Ubuntu machine, with cuda etc. set up, and speed up the
>> process noticably compared with probtrackx2. I want to parallelize the
>> batch by sending to our Sun Grid Engine, which has a cuda machine
>> configured, but I'm getting out of memory errors. I allocated up to 16 GB
>> to each job, which should be plenty given that my local machine runs them
>> with 16 GB RAM, and the grid machine has 125 GB total. Our admin checked
>> the logs, and nvidia-smi reports that the job barely used any RAM (copy
>> below), so we're trying to figure out what is triggering the error on the
>> grid but not on the local machine. (The same job runs OK using the regular,
>> non-gpu version of probtrackx2).
>>
>> Please let me know if you can help diagnose the problem. I'm happy to
>> produce whatever logging you need if you tell me how.
>>
>> Best wishes
>>
>> Paul Wright
>>
>> Command:
>> /software/system/fsl/fsl-6.0.0/bin/probtrackx2_gpu -s
>> /data/stcog05.bedpostX/merged -m /data/stcog05.bedpostX/nodif_brain_mask -x
>> /data/stcog05.probtrack/masksSeed.txt -V 2 --dir=/data/stcog05.probtrack
>> --forcedir --network --waypoints=/data/stcog05.probtrack/masksWaypoint.txt
>> --waycond=OR --onewaycondition
>> --avoid=/data/stcog05.probtrack/masks/ventricles --opd -l
>>
>> stdout:
>> PROBTRACKX2 VERSION GPU
>> Log directory is: /data/stcog05.probtrackx
>> Running in network mode
>> Number of Seeds: 2640
>> Dimensions Network Matrix: 2 x 2
>>
>> Time Loading Data: 22 seconds
>>
>>
>> ...................Allocated GPU 0...................
>> Free memory at the beginning: 11911102464 ---- Total memory: 11996954624
>> Free memory after copying masks: 11465326592 ---- Total memory: 11996954624
>> Running 476136 streamlines in parallel using 2 STREAMS
>> Total number of streamlines: 13200000
>>
>> stderr:
>> CUDA Runtime Error: out of memory
>>
>> uname -a
>> Linux nanlnx16.iop.kcl.ac.uk 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26
>> 12:36:06 CST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> hostnamectl
>>    Static hostname: nanlnx16.iop.kcl.ac.uk
>>          Icon name: computer
>>         Machine ID: 183fb3179d0349ed8c4bdc57ca5297ff
>>            Boot ID: 886d6ba0fd054eb9a3efd995f67fa6a3
>>   Operating System: Scientific Linux 7.6 (Nitrogen)
>>        CPE OS Name: cpe:/o:scientificlinux:scientificlinux:7.6:GA
>>             Kernel: Linux 3.10.0-957.1.3.el7.x86_64
>>       Architecture: x86-64
>>
>> modinfo nvidia
>> filename:
>>  /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/video/nvidia.ko
>> alias:          char-major-195-*
>> version:        410.79
>> supported:      external
>> license:        NVIDIA
>> retpoline:      Y
>> rhelversion:    7.6
>> srcversion:     1283EC37DF82D5A8A902589
>> alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
>> alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
>> alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
>> depends:        ipmi_msghandler
>> vermagic:       3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions
>> parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
>> parm:           NVreg_Mobile:int
>> parm:           NVreg_ResmanDebugLevel:int
>> parm:           NVreg_RmLogonRC:int
>> parm:           NVreg_ModifyDeviceFiles:int
>> parm:           NVreg_DeviceFileUID:int
>> parm:           NVreg_DeviceFileGID:int
>> parm:           NVreg_DeviceFileMode:int
>> parm:           NVreg_UpdateMemoryTypes:int
>> parm:           NVreg_InitializeSystemMemoryAllocations:int
>> parm:           NVreg_UsePageAttributeTable:int
>> parm:           NVreg_MapRegistersEarly:int
>> parm:           NVreg_RegisterForACPIEvents:int
>> parm:           NVreg_CheckPCIConfigSpace:int
>> parm:           NVreg_EnablePCIeGen3:int
>> parm:           NVreg_EnableMSI:int
>> parm:           NVreg_TCEBypassMode:int
>> parm:           NVreg_UseThreadedInterrupts:int
>> parm:           NVreg_EnableStreamMemOPs:int
>> parm:           NVreg_EnableBacklightHandler:int
>> parm:           NVreg_EnableUserNUMAManagement:int
>> parm:           NVreg_MemoryPoolSize:int
>> parm:           NVreg_KMallocHeapMaxSize:int
>> parm:           NVreg_VMallocHeapMaxSize:int
>> parm:           NVreg_IgnoreMMIOCheck:int
>> parm:           NVreg_RegistryDwords:charp
>> parm:           NVreg_RegistryDwordsPerDevice:charp
>> parm:           NVreg_RmMsg:charp
>> parm:           NVreg_GpuBlacklist:charp
>> parm:           NVreg_AssignGpus:charp
>>
>> nvcc --version
>> nvcc: NVIDIA (R) Cuda compiler driver
>> Copyright (c) 2005-2018 NVIDIA Corporation
>> Built on Sat_Aug_25_21:08:01_CDT_2018
>> Cuda compilation tools, release 10.0, V10.0.130
>>
>> qacct -u k1347787 -j \* -b 201902221200 -q cuda
>> ==============================================================
>> qname        cuda
>> hostname     nanlnx16.iop.kcl.ac.uk
>> group        image
>> owner        k1347787
>> project      NONE
>> department   defaultdepartment
>> jobname      fscon3vprobtrackx_gpu.job
>> jobnumber    4422736
>> taskid       1
>> account      sge
>> priority     0
>> qsub_time    Fri Feb 22 13:17:10 2019
>> start_time   Fri Feb 22 13:17:16 2019
>> end_time     Fri Feb 22 13:17:50 2019
>> granted_pe   NONE
>> slots        1
>> failed       0
>> exit_status  0
>> ru_wallclock 34s
>> ru_utime     23.006s
>> ru_stime     5.679s
>> ru_maxrss    5.473MB
>> ru_ixrss     0.000B
>> ru_ismrss    0.000B
>> ru_idrss     0.000B
>> ru_isrss     0.000B
>> ru_minflt    1541199
>> ru_majflt    103
>> ru_nswap     0
>> ru_inblock   665408
>> ru_oublock   19016
>> ru_msgsnd    0
>> ru_msgrcv    0
>> ru_nsignals  0
>> ru_nvcsw     13074
>> ru_nivcsw    1725
>> cpu          28.685s
>> mem          26.492GBs
>> io           332.312MB
>> iow          0.000s
>> maxvmem      4.477GB
>> arid         undefined
>> ar_sub_time  undefined
>> category     -u k1347787 -q cuda -l h_vmem=16G
>>
>> ########################################################################
>>
>> To unsubscribe from the FSL list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>>
>
>########################################################################
>
>To unsubscribe from the FSL list, click the following link:
>https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1
>

########################################################################

To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1


To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1