Hello FSL experts!
We have been running ICA+FIX on subjects from ABCD 2.0.1 Release on our HPC with some success, but are encountering a variety of unpredictable errors which have been unable to resolve.
For context, we are parralelizing our ICA_FIX commands on the HPC using the slurm-based scheduler, with each subject's ICA-FIX command in following format:
export MCR_CACHE_ROOT=/lscratch/$SLURM_JOB_ID && module load R fsl connectome-workbench && cd /data/ABCD_MBDU/abcd_bids/bids/derivatives/dcan_reproc/sub-NDARINV3CVRZ501/ses-baselineYear1Arm1/files/MNINonLinear/Results && /data/ABCD_MBDU/goyaln2/fix/fix_multi_run.sh [log in to unmask]@[log in to unmask] 2000 fix_proc/task-rest_concat TRUE /data/ABCD_MBDU/goyaln2/fix/training_files/HCP_Style_Single_Multirun_Dedrift.RData
We believe that the errors we are encountering and due to parallelization (the command above worked successfully when run locally, but not in our slurm job) and multiple jobs attempting to access the same ICA+FIX and Matlab MCR resources at once.
For the above command, we are only using the subject's rsfMRI runs 2,3,4, and 5 because run 1 was too short (we previously found that runs which are too short cause ICA+FIX to crash in the Compiled_functionhighpassandvariancenormalize stage, but we are unsure why. The previous discussion on this can be found here: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind2006&L=FSL&O=D&X=65CC289CECFB55D8A8&Y=dmoracze%40nih.gov&P=112349). The "export MCR_CACHE_ROOT=/lscratch/$SLURM_JOB_ID" was added to attempt to avoid parallel jobs conflicting.
We are using ICA+FIX version 1.06.15, and compiled matlab scripts (matlab compiled runtime v93 as indicated by the ICA+FIX release). There is one copy of ICA+FIX and compiled Matlab scripts which all the parallel processes are calling. Is this okay? Or is it possible that errors can arise by having many runs (up to 1000 at once) attempting to utilize the same resources?
So far, 4775 out of 5016 subjects have completed ICA+FIX. However to get to this point it was necessary to cancel our batch jobs, and resubmit multiple times, each time on a smaller subset of subjects that failed in the previous runs. Some commands fail very quickly, some hang for much longer than expected (eg > 6 hours), and some seem to finish without issue.
Some errors we are seeing include:
1. Some runs fail very quickly in the ICA+FIX Compiled_functionhighpassandvariancenormalize stage. Here is an exmaple of the error (we are unsure what code 1 means):
Mon Jun 29 10:56:45 EDT 2020:fix_multi_run.sh: ERROR: '/data/ABCD_MBDU/goyaln2/fix/call_matlab.sh' command failed with return code: 1
===> ERROR: Command returned with nonzero exit code
---------------------------------------------------
script: fix_multi_run.sh
stopped at line: 468
call: ${matlab_cmd}
expanded call: /data/ABCD_MBDU/goyaln2/fix/call_matlab.sh -c /data/ABCD_MBDU/goyaln2/MCR/v93 -b /data/ABCD_MBDU/goyaln2/hcp_pipeline/HCPpipelines-4.1.3/ICAFIX/scripts/Compiled_functionhighpassandvariancenormalize -f functionhighpassandvariancenormalize 0.800000 2000 task-rest01 /usr/local/apps/connectome-workbench/1.4.2/wb_command
exit code: 1
--------------------------------------------------
The folder for this subjects looks like this when the ICA+FIX run fails:
/data/ABCD_MBDU/abcd_bids/bids/derivatives/dcan_reproc/sub-NDARINV3CVRZ501/ses-baselineYear1Arm1/files/MNINonLinear/
├── brainmask_fs.nii.gz
├── Results
│ ├── task-rest01
│ │ ├── Movement_Regressors.txt
│ │ ├── task-rest01_Atlas.dtseries.nii
│ │ ├── task-rest01.nii.gz
│ │ └── task-rest01_SBRef.nii.gz
│ ├── task-rest02
│ │ ├── Atlas.nii.gz
│ │ ├── Movement_Regressors_demean.txt
│ │ ├── Movement_Regressors.txt
│ │ ├── task-rest02_Atlas_demean.dtseries.nii
│ │ ├── task-rest02_Atlas.dtseries.nii
│ │ ├── task-rest02_Atlas_mean.dscalar.nii
│ │ ├── task-rest02_demean.nii.gz
│ │ ├── task-rest02_hp2000.ica
│ │ │ └── mc
│ │ │ ├── prefiltered_func_data_mcf_conf_hp.nii.gz
│ │ │ ├── prefiltered_func_data_mcf_conf.nii.gz
│ │ │ └── prefiltered_func_data_mcf.par
│ │ ├── task-rest02_hp2000.nii.gz
│ │ ├── task-rest02_mean.nii.gz
│ │ ├── task-rest02.nii.gz
│ │ └── task-rest02_SBRef.nii.gz
│ ├── task-rest03
│ │ ├── Movement_Regressors.txt
│ │ ├── task-rest03_Atlas.dtseries.nii
│ │ ├── task-rest03.nii.gz
│ │ └── task-rest03_SBRef.nii.gz
│ ├── task-rest04
│ │ ├── Movement_Regressors.txt
│ │ ├── task-rest04_Atlas.dtseries.nii
│ │ ├── task-rest04.nii.gz
│ │ └── task-rest04_SBRef.nii.gz
│ └── task-rest05
│ ├── Movement_Regressors.txt
│ ├── task-rest05_Atlas.dtseries.nii
│ ├── task-rest05.nii.gz
│ └── task-rest05_SBRef.nii.gz
├── T1w.nii.gz
├── T1w_restore_brain.nii.gz
├── T2w.nii.gz
└── wmparc.nii.gz
This subject does run successfully when we run ICA+FIX on their data locally, making us think that the errors we are encountering are most likely due to parallelization.
2. When we submit large batch jobs, the error "Maximum number of clients reachedPostVMInit failed to initialize com.mathworks.mwswing.MJStartupForDesktop PostVMInit failed to initialize com.mathworks.mwswing.MJStartup" occurs many times.
3. Some subjects have the error "No convergence after 500 steps" in their logs. We think this is occurring in the Melodic stage. We found this error in the log for a subject that was supposedly "successful" - is this cause for concern?
I have log files from successful and failed runs available for reference if needed.
Any guidance would be greatly appreciated!
Best,
Nik
########################################################################
To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=FSL&A=1
This message was issued to members of www.jiscmail.ac.uk/FSL, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
|