John,
I think your configuration of SGE is probably okay. Looks more like
the command strings to fsl_sub are getting corrupted somehow. Could
you send me your new fsl_sub so I can see what you've changed.
Are you able to use fsl_sub to submit simple command lines, e.g.,
"fsl_sub ls"?
Did you make any changes other than those to fsl_sub?
On 21 Oct 2008, at 07:59, John Bushnell wrote:
> FSL users,
>
> I've set up a Linux cluster (CentOS 5, x86_64) with SGE (version
> 6.1u5) and
> installed FSL (version fsl-4.1.0-centos5_64 with patch
> fsl-centos5_64-patch-4.1.1_from_4.1.0). I am able to launch simple
> test
> jobs on the SGE queue (configured as a single queue "all.q"). I
> have edited
> the file /usr/local/fsl/bin/fsl_sub to use this single queue as I had
> previously done on a Rocks 4.3 cluster where we were running parallel
> FSL/bedpostx jobs. And I have added the file fsl.sh to /etc/
> profile.d on
> the head node and all compute nodes which contains:
>
> FSLDIR=/usr/local/fsl
> . ${FSLDIR}/etc/fslconf/fsl.sh
> PATH=${FSLDIR}/bin:${PATH}
> export FSLDIR PATH
>
> since all users are using the bash shell. (/usr/local is NFS
> mounted from
> the head node across all of the compute nodes.)
>
> Now, I have a user who is trying to run FSL/feat jobs as the first
> user of
> this cluster, and the jobs all seem to get submitted and then die
> without
> any output. However, I am seeing this in the SGE logs for every job:
>
> 10/20/2008 17:20:11|qmaster|hydra|W|job 112.1 failed on host
> node02.bic.ucsb.edu general searching requested shell because:
> 10/20/2008
> 17:20:10 [506:7430]: execvp(feat5_stop, "feat5_stop" "-m" "n" "-o"
> "logs"
> "-e" "logs" "-hold_jid" "107,108,111,110" "/usr/local/fsl/bin/feat"
> "/home/nwymbs/Chunk_fMRI/subjects/sg_004/test++.feat/design.fsf" "-D"
> "/home/nwymbs/Chunk_fMRI/subjects/sg_004/test++.feat" "-stop")
> failed: No
> such file or directory
> 10/20/2008 17:20:11|qmaster|hydra|W|rescheduling job 112.1
>
> (I see similar errors for feat5_reg, feat4_post, etc.)
>
> And this leaves me puzzled. Where is this strange call to "execvp"
> coming
> from? Why is the first argument "feat5_stop" which doesn't exist as
> a file
> anywhere on the system? I see references to feat5_stop in
> /usr/local/fsl/bin/feat, but the first argument to execvp is
> supposed to be
> a loadable program file (from my understanding after reading the man
> page
> for execvp). Is FSL somehow not configured correctly to talk to
> SGE? Or do
> I have to configure SGE somehow?
>
> Hopefully I have just missed something simple, but I'm rather
> stumped right
> now. Any suggestions for figuring this out would be greatly
> appreciated.
>
> Thanks for any ideas! - John
>
Cheers, Dave
--
Dave Flitney, IT Manager
Oxford Centre for Functional MRI of the Brain
E:[log in to unmask] W:+44-1865-222713 F:+44-1865-222717
URL: http://www.fmrib.ox.ac.uk/~flitney
|