Ah, so it says I/O error. but this means it's out of memory. Right...
I'll now chalk up the hours Mark and I spent trying to debug file
transfers to the CE as gaining a deeper understanding of the Zen of
Globus error messages. Or is that actually Alice in Wonderland?
'When I use a error message,' Humpty Dumpty said, in rather a
scornful tone, `it means just what I choose it to mean -- neither
more nor less.'
`The question is,' said Alice, `whether you can make error messages
mean so many different things.'
`The question is,' said Humpty Dumpty, `which is to be master --
that's all.'
Alice was too much puzzled to say anything; so after a minute Humpty
Dumpty began again. `They've a temper, some of them -- particularly
Globus errors: they're the proudest - batch system errors you can do
anything with, but not Globus errors - however, I can manage the
whole lot of them! Impenetrability! That's what I say!'
`Would you tell me please,' said Alice, `what that means?'
Graeme
PS. https://savannah.cern.ch/bugs/index.php?25048 offers a path to
atonement, if not enlightenment...
On 26 Mar 2007, at 16:17, Maarten Litmaath wrote:
> Maarten Litmaath wrote:
>
>> Mark Nelson wrote:
>>> Hello
>>>
>>> I have a problem with my site, the site has been working
>>> perfectly for months, this morning at ~ 1am it suddenly stopped
>>> accepting jobs, the following have been checked -
>>>
>>> 1. /tmp and /scratch (TMP directory for grid jobs) are not full
>>> 2. Home directories are not full and they are no issues with
>>> quota's
>>> 3. Can submit to the batch system via qsub on the CE.
>>> 4. edg-job-submit submits job.
>>> 5. edg-job-status returns the following error -
>>> "cannot plan: BrokerHelper: no compatible resource."
>> For the "ops" VO SAM reports this:
>> Globus error 3: an I/O operation failed
>> Did you look at this Wiki entry:
>> http://goc.grid.sinica.edu.tw/gocwiki/Globus_error_3
>> In particular, what is the current memory usage on your CE?
>
> Indeed, you have a huge number of globus-job-manager processes
> stuck like this:
>
> ----------------------------------------------------------------------
> ------------
> atlas032 9591 0.1 0.1 5008 2844 ? S 15:58 0:00
> globus-job-manager
> -conf /opt/globus/etc/globus-job-manager.conf -type lcgpbs -rdn
> jobmanager-lcgpbs
> -machine-type unknown -publish-jobs
> atlas032 9615 0.5 0.2 7612 6060 ? S 15:59 0:00 \_ /
> usr/bin/perl
> /opt/globus/libexec/globus-job-manager-script.pl -m lcgpbs -f /tmp/
> gram_EHLVFj -c
> cache_cleanup
> ----------------------------------------------------------------------
> ------------
>
> Your home directories are NFS-automounted: were there any problems
> with NFS or the
> automounter recently?
>
> Can you do the following:
>
> lsof -p 9615 -o lsof.out
> strace -p 9615 -o strace.out
>
> Interrupt the strace after some time and send me the output of both
> commands.
>
> Please leave a few of those stuck processes around (e.g. one per
> grid account)
> and kill the rest.
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
GridPP DM Wiki - http://wiki.gridpp.ac.uk/wiki/Data_Management
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|