Print

Print


Hello everybody,

I'm scratching my head trying to figure out the cause of some 
hammercloud failures at our site and I thought I'd turn to my comrades 
across the nation for help.

We're regularly failing hammercloud tests with the same type of failure 
- athena fails to open a conditions file then crashes.
Take job:
http://panda.cern.ch/server/pandamon/query?job=*&jobsetID=6138851&user=Johannes%20Elmsheuser&days=0

The oft misleading log extracts have the interesting lines:
SysError in : file 
rfio:////dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root 
can not be opened for reading (System error)
Error in : Cannot build a TreeIndex with a Tree having no entries

Delving deeper into the athena_stdout.txt I see what could be interesting:
*snip*
Domain[ROOT_All] Info >   Access   DbDomain     READ      [ROOT_All]
Domain[ROOT_All] Info ->  Access   DbDatabase   READ      [ROOT_All] 
540D53BB-0DCE-DF11-B2D0-000423D2B9E8
Domain[ROOT_All] Info 
rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
RootDBase.open Error You cannot open the ROOT file 
[rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root] 
in mode READ if it does not exists.
StorageSvc Error Cannot connect to Database: 
FID=540D53BB-0DCE-DF11-B2D0-000423D2B9E8 
PFN=rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
Domain[ROOT_All] Info >   Deaccess DbDomain     READ      [ROOT_All]
AthenaPoolConverter                                 ERROR poolToObject: 
caught error: Could not connect to the file ( POOL : 
"PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )
PixelDetectorManager                                ERROR Cannot find 
AlignableTransformContainer for key /Indet/Align - no misalignment
PixelDetectorManager                                FATAL Unable to 
apply Inner Detector alignments
*snip*

Looked like a bad file to me, or maybe a bad pool node. Except I've 
check every replica of that file, they all exist. The pool nodes that 
they are on seem healthy. The checksums on all the replicas match (the 
adler32 is fa7e888b), the rfio logs on all the pools show no errors. 
There are no persistant offenders - hammercloud jobs accessing other 
files work fine on the same nodes these jobs fail on. Routing between 
all the workers and the pool nodes seem fine, and the network is 
uncongested at the moment.

And I've now run out of ideas, have a fallen victim to a red herring or 
have a left something out that I can't see for staring too hard?

Much obliged,
Matt