Hello Matthew,
I am a colleague of Diederick. We just found out the reason for the crashes. Actually I would think that we found a bug in FSL or at least the documentation of script fsl_sub is not really accurate. According to the usage synopsis of fsl_sub (which we didn't call directly, but through feat), the argument -T is used to specify the "estimated job length in minutes, used to let SGE select an appropriate queue". But in fact, such time estimate is sent as a hard time limit to qsub (option h_rt, around line 246 of script fsl_sub), which is a rather different thing. Since the task duration estimates for 7T data are too low, this was leading to an early death of either the susan or the melodic step, depending on the server load. Is there any reason/benefit for using hard time limits? We edited fsl_sub so that option -T is ignored and everything runs smoothly now. Another solution would be to increase the task duration estimates that feat produces, but estimates can always be wrong. What if one day someone decides to analyze 15T data with FSL?
In my opinion setting h_rt should be done by the SGE admins (i.e. by the users of FSL) rather than by FSL directly. Otherwise you are risking finding lots of dead jobs at different stages without a clue of what happened. Especially when you are sure that you have no hard limits in any of your available queues and your grid nodes are powerful enough to run the jobs, which was our case.
Best regards,
Germán
|