Thanks Alessandra. I couldn't find any problems with the certs or crls
but you got me thinking about what causes similar errors and I checked
/etc/shift.conf on the servers. One of them on a pool containing this
file was wrong (yaim had used the internal network name, not the
external one) - I'm not sure how that had slipped through the cracks.
Lets see if correcting it will cause the problems to disappear. In the
mean time I'll scour my pools for similar problems.
Cheers,
Matt
Alessandra Forti wrote:
> Hi Matt,
>
> the file is healthy its checksum is the same as the one stored. The
> interesting lines are above the rfio one.
>
> send2nsd: NS002 - send error : client_establish_context: Could not find or use a credential
> send2dpm: DP002 - send error : client_establish_context: Could not find or use a credential
>
> I'd check all your host certificates and crls are in place on all the
> data servers.
>
> cheers
> alessandra
>
> On 31/05/2011 13:12, Matt Doidge wrote:
>> Hello everybody,
>>
>> I'm scratching my head trying to figure out the cause of some
>> hammercloud failures at our site and I thought I'd turn to my comrades
>> across the nation for help.
>>
>> We're regularly failing hammercloud tests with the same type of
>> failure - athena fails to open a conditions file then crashes.
>> Take job:
>> http://panda.cern.ch/server/pandamon/query?job=*&jobsetID=6138851&user=Johannes%20Elmsheuser&days=0
>>
>>
>> The oft misleading log extracts have the interesting lines:
>> SysError in : file
>> rfio:////dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
>> can not be opened for reading (System error)
>> Error in : Cannot build a TreeIndex with a Tree having no entries
>>
>> Delving deeper into the athena_stdout.txt I see what could be
>> interesting:
>> *snip*
>> Domain[ROOT_All] Info > Access DbDomain READ [ROOT_All]
>> Domain[ROOT_All] Info -> Access DbDatabase READ [ROOT_All]
>> 540D53BB-0DCE-DF11-B2D0-000423D2B9E8
>> Domain[ROOT_All] Info
>> rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
>> RootDBase.open Error You cannot open the ROOT file
>> [rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root]
>> in mode READ if it does not exists.
>> StorageSvc Error Cannot connect to Database:
>> FID=540D53BB-0DCE-DF11-B2D0-000423D2B9E8
>> PFN=rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
>> Domain[ROOT_All] Info > Deaccess DbDomain READ [ROOT_All]
>> AthenaPoolConverter ERROR
>> poolToObject: caught error: Could not connect to the file ( POOL :
>> "PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )
>> PixelDetectorManager ERROR Cannot find
>> AlignableTransformContainer for key /Indet/Align - no misalignment
>> PixelDetectorManager FATAL Unable to
>> apply Inner Detector alignments
>> *snip*
>>
>> Looked like a bad file to me, or maybe a bad pool node. Except I've
>> check every replica of that file, they all exist. The pool nodes that
>> they are on seem healthy. The checksums on all the replicas match (the
>> adler32 is fa7e888b), the rfio logs on all the pools show no errors.
>> There are no persistant offenders - hammercloud jobs accessing other
>> files work fine on the same nodes these jobs fail on. Routing between
>> all the workers and the pool nodes seem fine, and the network is
>> uncongested at the moment.
>>
>> And I've now run out of ideas, have a fallen victim to a red herring
>> or have a left something out that I can't see for staring too hard?
>>
>> Much obliged,
>> Matt
>
|