Print

Print


Matt 

I don't have the answer for you. But dq2 agrees that is the correct checksum for the file 
[ ]	cond10_data.000019.gen.COND._0003.pool.root	540D53BB-0DCE-DF11-B2D0-000423D2B9E8	ad:fa7e888b	570085
and I can copy it and it looks sound.

So I conclude there is nothing wrong with the file. But perhaps it shouldn't be looking for it at all. 
It looks to me a bit like all the test jobs running on "real" data fail at lancs (ie data* rather than mc* datasets). 

So perhaps (and this a total guess) it should be getting this from the squid. 

Actually I no longer know how to check squid health as it seems to be removed from the sam tests ...
This link _doesn't_ show the squid test
https://lcg-sam.cern.ch:8443/sam/sam.py?CE_atlas_disp_tests=CE-ATLAS-sft-Frontier-Squid&order=SiteName&funct=ShowSensorTests&disp_status=na&disp_status=ok&disp_status=info&disp_status=note&disp_status=warn&disp_status=error&disp_status=crit&disp_status=maint

is there actually a nagios for atlas sam tests now (a side question)?

Sorry thats not actually helpful... and may send you down an even blinder alley.

Wahid


On 31 May 2011, at 13:12, Matt Doidge wrote:

> Hello everybody,
> 
> I'm scratching my head trying to figure out the cause of some hammercloud failures at our site and I thought I'd turn to my comrades across the nation for help.
> 
> We're regularly failing hammercloud tests with the same type of failure - athena fails to open a conditions file then crashes.
> Take job:
> http://panda.cern.ch/server/pandamon/query?job=*&jobsetID=6138851&user=Johannes%20Elmsheuser&days=0
> 
> The oft misleading log extracts have the interesting lines:
> SysError in : file rfio:////dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root can not be opened for reading (System error)
> Error in : Cannot build a TreeIndex with a Tree having no entries
> 
> Delving deeper into the athena_stdout.txt I see what could be interesting:
> *snip*
> Domain[ROOT_All] Info >   Access   DbDomain     READ      [ROOT_All]
> Domain[ROOT_All] Info ->  Access   DbDatabase   READ      [ROOT_All] 540D53BB-0DCE-DF11-B2D0-000423D2B9E8
> Domain[ROOT_All] Info rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
> RootDBase.open Error You cannot open the ROOT file [rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root] in mode READ if it does not exists.
> StorageSvc Error Cannot connect to Database: FID=540D53BB-0DCE-DF11-B2D0-000423D2B9E8 PFN=rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
> Domain[ROOT_All] Info >   Deaccess DbDomain     READ      [ROOT_All]
> AthenaPoolConverter                                 ERROR poolToObject: caught error: Could not connect to the file ( POOL : "PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )
> PixelDetectorManager                                ERROR Cannot find AlignableTransformContainer for key /Indet/Align - no misalignment
> PixelDetectorManager                                FATAL Unable to apply Inner Detector alignments
> *snip*
> 
> Looked like a bad file to me, or maybe a bad pool node. Except I've check every replica of that file, they all exist. The pool nodes that they are on seem healthy. The checksums on all the replicas match (the adler32 is fa7e888b), the rfio logs on all the pools show no errors. There are no persistant offenders - hammercloud jobs accessing other files work fine on the same nodes these jobs fail on. Routing between all the workers and the pool nodes seem fine, and the network is uncongested at the moment.
> 
> And I've now run out of ideas, have a fallen victim to a red herring or have a left something out that I can't see for staring too hard?
> 
> Much obliged,
> Matt
> 


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.