Print

Print


  1. Looking more carefully, this seems to be related to not having compiled with -nojvm: https://www.mathworks.com/matlabcentral/answers/411702-maximum-number-of-clients-reached-error-while-running-large-number-of-compiled-matlab-scripts-on-h. We may be able to fix this in the future, but likely you will be limited in how many jobs can run at the same time on the same machine (perhaps to 256 or 512 depending on configuration).  I am surprised you are running into this error before running out of cores or RAM though. 
  2. Yes this is a harmless error.

 

Matt.   

 

From: FSL - FMRIB's Software Library <[log in to unmask]> on behalf of Nikhil Goyal <[log in to unmask]>
Reply-To: FSL - FMRIB's Software Library <[log in to unmask]>
Date: Monday, June 29, 2020 at 1:29 PM
To: "[log in to unmask]" <[log in to unmask]>
Subject: Re: [FSL] ICA+FIX Jobs failing when ran in parallel using SLURM scheduler

 

 

Hi Matt,

 

Thank you for the quick response. We are using compiled Matlab (MCR v93) and the scripts included in ICA+FIX version 1.06.15 and HCP Pipeline v4.1.3.

 

In regard to your second point, does that mean that runs where we see the "no convergence" error ultimately were processed successfully by automatically using fewer components, and can we assume that the appearance of this error is harmless?

 

Thanks!

 

Nik

 

On Mon, Jun 29, 2020 at 2:15 PM Glasser, Matthew <[log in to unmask]> wrote:

Is this compiled or interpreted matlab? Some of those errors might relate to exceeding your matlab licenses.

If there is no ICA convergence, it will try with fewer ICA components and should eventually succeed.

Matt.

On 6/29/20, 12:11 PM, "FSL - FMRIB's Software Library on behalf of Nikhil Goyal" <[log in to unmask] on behalf of [log in to unmask]> wrote:

    * External Email - Caution *

    Hello FSL experts!

    We have been running ICA+FIX on subjects from ABCD 2.0.1 Release on our HPC with some success, but are encountering a variety of unpredictable errors which have been unable to resolve.

    For context, we are parralelizing our ICA_FIX commands on the HPC using the slurm-based scheduler, with each subject's ICA-FIX command in following format:

    export MCR_CACHE_ROOT=/lscratch/$SLURM_JOB_ID &&amp;amp; module load R fsl connectome-workbench &&amp;amp; cd /data/ABCD_MBDU/abcd_bids/bids/derivatives/dcan_reproc/sub-NDARINV3CVRZ501/ses-baselineYear1Arm1/files/MNINonLinear/Results &&amp;amp; /data/ABCD_MBDU/goyaln2/fix/fix_multi_run.sh [log in to unmask]@[log in to unmask] 2000 fix_proc/task-rest_concat TRUE /data/ABCD_MBDU/goyaln2/fix/training_files/HCP_Style_Single_Multirun_Dedrift.RData

    We believe that the errors we are encountering and due to parallelization (the command above worked successfully when run locally, but not in our slurm job) and multiple jobs attempting to access the same ICA+FIX and Matlab MCR resources at once.

    For the above command, we are only using the subject's rsfMRI runs 2,3,4, and 5 because run 1 was too short (we previously found that runs which are too short cause ICA+FIX to crash in the Compiled_functionhighpassandvariancenormalize stage, but we are unsure why. The previous discussion on this can be found here: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind2006&L=FSL&O=D&X=65CC289CECFB55D8A8&Y=dmoracze%40nih.gov&P=112349). The "export MCR_CACHE_ROOT=/lscratch/$SLURM_JOB_ID" was added to attempt to avoid parallel jobs conflicting.
    We are using ICA+FIX version 1.06.15, and compiled matlab scripts (matlab compiled runtime v93 as indicated by the ICA+FIX release). There is one copy of ICA+FIX and compiled Matlab scripts which all the parallel processes are calling. Is this okay? Or is it possible that errors can arise by having many runs (up to 1000 at once) attempting to utilize the same resources?

    So far, 4775 out of 5016 subjects have completed ICA+FIX. However to get to this point it was necessary to cancel our batch jobs, and resubmit multiple times, each time on a smaller subset of subjects that failed in the previous runs. Some commands fail very quickly, some hang for much longer than expected (eg > 6 hours), and some seem to finish without issue.

    Some errors we are seeing include:
    1. Some runs fail very quickly in the ICA+FIX Compiled_functionhighpassandvariancenormalize stage. Here is an exmaple of the error (we are unsure what code 1 means):
        Mon Jun 29 10:56:45 EDT 2020:fix_multi_run.sh: ERROR: '/data/ABCD_MBDU/goyaln2/fix/call_matlab.sh' command failed with return code: 1
        ===> ERROR: Command returned with nonzero exit code
        ---------------------------------------------------
                script: fix_multi_run.sh
        stopped at line: 468
                call: ${matlab_cmd}
        expanded call: /data/ABCD_MBDU/goyaln2/fix/call_matlab.sh -c /data/ABCD_MBDU/goyaln2/MCR/v93 -b /data/ABCD_MBDU/goyaln2/hcp_pipeline/HCPpipelines-4.1.3/ICAFIX/scripts/Compiled_functionhighpassandvariancenormalize -f functionhighpassandvariancenormalize 0.800000 2000 task-rest01 /usr/local/apps/connectome-workbench/1.4.2/wb_command
            exit code: 1
        --------------------------------------------------
    The folder for this subjects looks like this when the ICA+FIX run fails:
        /data/ABCD_MBDU/abcd_bids/bids/derivatives/dcan_reproc/sub-NDARINV3CVRZ501/ses-baselineYear1Arm1/files/MNINonLinear/
        ── brainmask_fs.nii.gz
        ── Results
        │   ── task-rest01
        │   │   ── Movement_Regressors.txt
        │   │   ── task-rest01_Atlas.dtseries.nii
        │   │   ── task-rest01.nii.gz
        │   │   └── task-rest01_SBRef.nii.gz
        │   ── task-rest02
        │   │   ── Atlas.nii.gz
        │   │   ── Movement_Regressors_demean.txt
        │   │   ── Movement_Regressors.txt
        │   │   ── task-rest02_Atlas_demean.dtseries.nii
        │   │   ── task-rest02_Atlas.dtseries.nii
        │   │   ── task-rest02_Atlas_mean.dscalar.nii
        │   │   ── task-rest02_demean.nii.gz
        │   │   ── task-rest02_hp2000.ica
        │   │   │   └── mc
        │   │   │       ── prefiltered_func_data_mcf_conf_hp.nii.gz
        │   │   │       ── prefiltered_func_data_mcf_conf.nii.gz
        │   │   │       └── prefiltered_func_data_mcf.par
        │   │   ── task-rest02_hp2000.nii.gz
        │   │   ── task-rest02_mean.nii.gz
        │   │   ── task-rest02.nii.gz
        │   │   └── task-rest02_SBRef.nii.gz
        │   ── task-rest03
        │   │   ── Movement_Regressors.txt
        │   │   ── task-rest03_Atlas.dtseries.nii
        │   │   ── task-rest03.nii.gz
        │   │   └── task-rest03_SBRef.nii.gz
        │   ── task-rest04
        │   │   ── Movement_Regressors.txt
        │   │   ── task-rest04_Atlas.dtseries.nii
        │   │   ── task-rest04.nii.gz
        │   │   └── task-rest04_SBRef.nii.gz
        │   └── task-rest05
        │       ── Movement_Regressors.txt
        │       ── task-rest05_Atlas.dtseries.nii
        │       ── task-rest05.nii.gz
        │       └── task-rest05_SBRef.nii.gz
        ── T1w.nii.gz
        ── T1w_restore_brain.nii.gz
        ── T2w.nii.gz
        └── wmparc.nii.gz

    This subject does run successfully when we run ICA+FIX on their data locally, making us think that the errors we are encountering are most likely due to parallelization.


    2. When we submit large batch jobs, the error "Maximum number of clients reachedPostVMInit failed to initialize com.mathworks.mwswing.MJStartupForDesktop PostVMInit failed to initialize com.mathworks.mwswing.MJStartup" occurs many times.

    3. Some subjects have the error "No convergence after 500 steps" in their logs. We think this is occurring in the Melodic stage. We found this error in the log for a subject that was supposedly "successful" - is this cause for concern?

    I have log files from successful and failed runs available for reference if needed.

    Any guidance would be greatly appreciated!
    Best,
    Nik

    ########################################################################

    To unsubscribe from the FSL list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=FSL&A=1

    This message was issued to members of www.jiscmail.ac.uk/FSL, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/


________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

########################################################################

To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=FSL&A=1

This message was issued to members of www.jiscmail.ac.uk/FSL, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/


 

--

Nikhil Goyal

IRTA Postbac Fellow

National Institute of Mental Health

BS | University of Maryland, College Park 2019

 


To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=FSL&A=1

 


The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.



To unsubscribe from the FSL list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=FSL&A=1