Hi-
I am encountering 2 seemingly unrelated problems submitting tbss_reg_2 jobs to SGE.
1. The first is a very large job (-n) which generates 16900 fits. After the *.msf files are generated, SGE returns the rather curious message: “Unable to run job: job rejected: You try to submit a job with more than 75000 tasks. Exiting.”. The configured task limit is in fact the default of 75000, but I am not submitting near that limit. I managed to find one comment somewhat off hand comment about problems when submitting more than 10K tasks, but the behavior would be that the queue would stall/stop. I believe I have combed though all of SGE and FSL to determine what is generating this error, but I am at a loss. To add to the convoluted problem, I also encountered this error when submitting a much smaller job (~200) in an attempt to troubleshoot. I believe this was a function of some file/spool preserving information from the larger job. After a complete clean-up and reinstall of SGE, this error did not obtain with the smaller job.
2. However, with the smaller job, there is another problem which seems to be more FSL related. I have explicitly set the variable "queue" in 'fsl_sub' to "verylong.q". When I run the smaller job, it is submitted and assigned the sequential job ID, however, the "queue" variable is ignored, it is run as 'short.q' and it the job is processed serially (one command at a time) using 1 of the 12 processors.
The cluster consists of 3 nodes each with 6 x 2 cores and 24 GB RAM, so I am confident that hardware limitations can be ruled out. I have also confirmed ('qping') that all the nodes are actively communicating with the master using the correct ports.
The only configuration change I have made was to the "load_thresholds np_load_ave". The default was 1.75 and I have it currently set to 32. My understanding of this parameter is that it should cause all cores to be maxed out 100% of the time. Regardless of a 'small' or 'large' value, there is no change in the behavior described above.
Any thoughts or suggestions (or confirmation of similar problems) are much appreciated.
Regards,
Wil
|