Hi All
We seem to be having problems with our gatekeeper at Glasgow, which
I'm having great trouble getting to the bottom of.
Jobs are being mapped into the batch system correctly, are running on
functional worker nodes, have exit status 0 from PBS, yet we have
ended up blacklisted by ATLAS in the FCR, because somewhere between
the batch system and the RB the job is failing[1]. Note the CE is not
excessively loaded and does not have full disks or blocked nfs mounts.
There are no obvious problems in the PBS server logs. There's no
correlated undelivered mail lying around on the CE or the worker nodes.
(Details here: http://scotgrid.blogspot.com/2007/06/glasgow-ce-
flaky.html)
The gatekeeper logs contain all the correct mapping information and
give no indication of errors.
Where else can I look? I'm stumped.
Cheers
Graeme
PS. I had noticed we seemed to get a ~2-3% failure on Steve's tests
because of this kind of problem - a sudden unexplained abort, which
then went away, but it just seems to have got a whole lot worse now.
[1] See the results 'blackhole' from midnight to 1030: https://lcg-
sam.cern.ch:8443/sam/sam.py?
funct=ShowHistory&sensors=CE&vo=atlas&nodename=svr016.gla.scotgrid.ac.uk
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|