Epilogue:
http://scotgrid.blogspot.com/2007/06/bad-worker-node-bad-bad.html
Check external networking has to be another basic test ;-)
g
On 7 Jun 2007, at 12:34, Graeme Stewart wrote:
> Hi All
>
> We seem to be having problems with our gatekeeper at Glasgow, which
> I'm having great trouble getting to the bottom of.
>
> Jobs are being mapped into the batch system correctly, are running
> on functional worker nodes, have exit status 0 from PBS, yet we
> have ended up blacklisted by ATLAS in the FCR, because somewhere
> between the batch system and the RB the job is failing[1]. Note the
> CE is not excessively loaded and does not have full disks or
> blocked nfs mounts.
>
> There are no obvious problems in the PBS server logs. There's no
> correlated undelivered mail lying around on the CE or the worker
> nodes.
>
> (Details here: http://scotgrid.blogspot.com/2007/06/glasgow-ce-
> flaky.html)
>
> The gatekeeper logs contain all the correct mapping information and
> give no indication of errors.
>
> Where else can I look? I'm stumped.
>
> Cheers
>
> Graeme
>
> PS. I had noticed we seemed to get a ~2-3% failure on Steve's tests
> because of this kind of problem - a sudden unexplained abort, which
> then went away, but it just seems to have got a whole lot worse now.
>
> [1] See the results 'blackhole' from midnight to 1030: https://lcg-
> sam.cern.ch:8443/sam/sam.py?
> funct=ShowHistory&sensors=CE&vo=atlas&nodename=svr016.gla.scotgrid.ac.
> uk
>
> --
> Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
> ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|