Hi Moises Our sysadmin installed the version of probtrackx2_gpu that was appropriate for our cuda machine's version. I will check with him that the versions are still in sync (ie no cuda update). Assuming versioning is correct, is there anything else I can do to diagnose? It's a mysterious error, is there seems to be plenty of memory free, and I sent it a job with just two seed masks, which shouldn't take up much memory. Thanks Paul On Thu, 28 Feb 2019 12:05:48 -0500, Moises Hernandez <[log in to unmask]> wrote: >Hi Paul, >It sounds to me like a problem related to CUDA binary version and the >architecture of the GPUs. >Are the GPUs different on the SGE machine? >If yes, you may need a different CUDA version of probtrackx2_gpu. Maybe >that one does not support the GPUs of the SGE machine > >Moises > >On Thu, 28 Feb 2019 at 07:30, Paul Wright < >[log in to unmask]> wrote: > >> Dear Moises et al. >> >> I'm using probtrackx2_gpu to run lots of small tracking jobs. My jobs run >> fine on my local Ubuntu machine, with cuda etc. set up, and speed up the >> process noticably compared with probtrackx2. I want to parallelize the >> batch by sending to our Sun Grid Engine, which has a cuda machine >> configured, but I'm getting out of memory errors. I allocated up to 16 GB >> to each job, which should be plenty given that my local machine runs them >> with 16 GB RAM, and the grid machine has 125 GB total. Our admin checked >> the logs, and nvidia-smi reports that the job barely used any RAM (copy >> below), so we're trying to figure out what is triggering the error on the >> grid but not on the local machine. (The same job runs OK using the regular, >> non-gpu version of probtrackx2). >> >> Please let me know if you can help diagnose the problem. I'm happy to >> produce whatever logging you need if you tell me how. >> >> Best wishes >> >> Paul Wright >> >> Command: >> /software/system/fsl/fsl-6.0.0/bin/probtrackx2_gpu -s >> /data/stcog05.bedpostX/merged -m /data/stcog05.bedpostX/nodif_brain_mask -x >> /data/stcog05.probtrack/masksSeed.txt -V 2 --dir=/data/stcog05.probtrack >> --forcedir --network --waypoints=/data/stcog05.probtrack/masksWaypoint.txt >> --waycond=OR --onewaycondition >> --avoid=/data/stcog05.probtrack/masks/ventricles --opd -l >> >> stdout: >> PROBTRACKX2 VERSION GPU >> Log directory is: /data/stcog05.probtrackx >> Running in network mode >> Number of Seeds: 2640 >> Dimensions Network Matrix: 2 x 2 >> >> Time Loading Data: 22 seconds >> >> >> ...................Allocated GPU 0................... >> Free memory at the beginning: 11911102464 ---- Total memory: 11996954624 >> Free memory after copying masks: 11465326592 ---- Total memory: 11996954624 >> Running 476136 streamlines in parallel using 2 STREAMS >> Total number of streamlines: 13200000 >> >> stderr: >> CUDA Runtime Error: out of memory >> >> uname -a >> Linux nanlnx16.iop.kcl.ac.uk 3.10.0-957.1.3.el7.x86_64 #1 SMP Mon Nov 26 >> 12:36:06 CST 2018 x86_64 x86_64 x86_64 GNU/Linux >> >> hostnamectl >> Static hostname: nanlnx16.iop.kcl.ac.uk >> Icon name: computer >> Machine ID: 183fb3179d0349ed8c4bdc57ca5297ff >> Boot ID: 886d6ba0fd054eb9a3efd995f67fa6a3 >> Operating System: Scientific Linux 7.6 (Nitrogen) >> CPE OS Name: cpe:/o:scientificlinux:scientificlinux:7.6:GA >> Kernel: Linux 3.10.0-957.1.3.el7.x86_64 >> Architecture: x86-64 >> >> modinfo nvidia >> filename: >> /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/video/nvidia.ko >> alias: char-major-195-* >> version: 410.79 >> supported: external >> license: NVIDIA >> retpoline: Y >> rhelversion: 7.6 >> srcversion: 1283EC37DF82D5A8A902589 >> alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00* >> alias: pci:v000010DEd*sv*sd*bc03sc02i00* >> alias: pci:v000010DEd*sv*sd*bc03sc00i00* >> depends: ipmi_msghandler >> vermagic: 3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions >> parm: NvSwitchRegDwords:NvSwitch regkey (charp) >> parm: NVreg_Mobile:int >> parm: NVreg_ResmanDebugLevel:int >> parm: NVreg_RmLogonRC:int >> parm: NVreg_ModifyDeviceFiles:int >> parm: NVreg_DeviceFileUID:int >> parm: NVreg_DeviceFileGID:int >> parm: NVreg_DeviceFileMode:int >> parm: NVreg_UpdateMemoryTypes:int >> parm: NVreg_InitializeSystemMemoryAllocations:int >> parm: NVreg_UsePageAttributeTable:int >> parm: NVreg_MapRegistersEarly:int >> parm: NVreg_RegisterForACPIEvents:int >> parm: NVreg_CheckPCIConfigSpace:int >> parm: NVreg_EnablePCIeGen3:int >> parm: NVreg_EnableMSI:int >> parm: NVreg_TCEBypassMode:int >> parm: NVreg_UseThreadedInterrupts:int >> parm: NVreg_EnableStreamMemOPs:int >> parm: NVreg_EnableBacklightHandler:int >> parm: NVreg_EnableUserNUMAManagement:int >> parm: NVreg_MemoryPoolSize:int >> parm: NVreg_KMallocHeapMaxSize:int >> parm: NVreg_VMallocHeapMaxSize:int >> parm: NVreg_IgnoreMMIOCheck:int >> parm: NVreg_RegistryDwords:charp >> parm: NVreg_RegistryDwordsPerDevice:charp >> parm: NVreg_RmMsg:charp >> parm: NVreg_GpuBlacklist:charp >> parm: NVreg_AssignGpus:charp >> >> nvcc --version >> nvcc: NVIDIA (R) Cuda compiler driver >> Copyright (c) 2005-2018 NVIDIA Corporation >> Built on Sat_Aug_25_21:08:01_CDT_2018 >> Cuda compilation tools, release 10.0, V10.0.130 >> >> qacct -u k1347787 -j \* -b 201902221200 -q cuda >> ============================================================== >> qname cuda >> hostname nanlnx16.iop.kcl.ac.uk >> group image >> owner k1347787 >> project NONE >> department defaultdepartment >> jobname fscon3vprobtrackx_gpu.job >> jobnumber 4422736 >> taskid 1 >> account sge >> priority 0 >> qsub_time Fri Feb 22 13:17:10 2019 >> start_time Fri Feb 22 13:17:16 2019 >> end_time Fri Feb 22 13:17:50 2019 >> granted_pe NONE >> slots 1 >> failed 0 >> exit_status 0 >> ru_wallclock 34s >> ru_utime 23.006s >> ru_stime 5.679s >> ru_maxrss 5.473MB >> ru_ixrss 0.000B >> ru_ismrss 0.000B >> ru_idrss 0.000B >> ru_isrss 0.000B >> ru_minflt 1541199 >> ru_majflt 103 >> ru_nswap 0 >> ru_inblock 665408 >> ru_oublock 19016 >> ru_msgsnd 0 >> ru_msgrcv 0 >> ru_nsignals 0 >> ru_nvcsw 13074 >> ru_nivcsw 1725 >> cpu 28.685s >> mem 26.492GBs >> io 332.312MB >> iow 0.000s >> maxvmem 4.477GB >> arid undefined >> ar_sub_time undefined >> category -u k1347787 -q cuda -l h_vmem=16G >> >> ######################################################################## >> >> To unsubscribe from the FSL list, click the following link: >> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1 >> > >######################################################################## > >To unsubscribe from the FSL list, click the following link: >https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1 > ######################################################################## To unsubscribe from the FSL list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=FSL&A=1