http://northgrid-tech.blogspot.com/2009/10/squid-cache-32bit.html
the test at the bottom should be still valid.
cheers
alessandra
On Tue, 31 May 2011, Matt Doidge wrote:
> Thanks Wahid, turning down a blind alley is still better then smacking your
> head against a brick wall!
>
> Our squid appears to be working, but I don't see workernode IPs or the worker
> node's nat box IP in the access.log - looks like there could be a
> communication problem.
>
> Does anyone know if there is a script/tool/simple command that one can use to
> mimic a query to the frontier server?
>
> Thanks again,
> Matt
>
> Wahid Bhimji wrote:
>> Matt I don't have the answer for you. But dq2 agrees that is the correct
>> checksum for the file [ ] cond10_data.000019.gen.COND._0003.pool.root
>> 540D53BB-0DCE-DF11-B2D0-000423D2B9E8 ad:fa7e888b 570085
>> and I can copy it and it looks sound.
>>
>> So I conclude there is nothing wrong with the file. But perhaps it shouldn't
>> be looking for it at all. It looks to me a bit like all the test jobs
>> running on "real" data fail at lancs (ie data* rather than mc* datasets). So
>> perhaps (and this a total guess) it should be getting this from the squid.
>> Actually I no longer know how to check squid health as it seems to be
>> removed from the sam tests ...
>> This link _doesn't_ show the squid test
>> https://lcg-sam.cern.ch:8443/sam/sam.py?CE_atlas_disp_tests=CE-ATLAS-sft-Frontier-Squid&order=SiteName&funct=ShowSensorTests&disp_status=na&disp_status=ok&disp_status=info&disp_status=note&disp_status=warn&disp_status=error&disp_status=crit&disp_status=maint
>>
>> is there actually a nagios for atlas sam tests now (a side question)?
>>
>> Sorry thats not actually helpful... and may send you down an even blinder
>> alley.
>>
>> Wahid
>>
>>
>> On 31 May 2011, at 13:12, Matt Doidge wrote:
>>
>>> Hello everybody,
>>>
>>> I'm scratching my head trying to figure out the cause of some hammercloud
>>> failures at our site and I thought I'd turn to my comrades across the
>>> nation for help.
>>>
>>> We're regularly failing hammercloud tests with the same type of failure -
>>> athena fails to open a conditions file then crashes.
>>> Take job:
>>> http://panda.cern.ch/server/pandamon/query?job=*&jobsetID=6138851&user=Johannes%20Elmsheuser&days=0
>>>
>>> The oft misleading log extracts have the interesting lines:
>>> SysError in : file
>>> rfio:////dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
>>> can not be opened for reading (System error)
>>> Error in : Cannot build a TreeIndex with a Tree having no entries
>>>
>>> Delving deeper into the athena_stdout.txt I see what could be interesting:
>>> *snip*
>>> Domain[ROOT_All] Info > Access DbDomain READ [ROOT_All]
>>> Domain[ROOT_All] Info -> Access DbDatabase READ [ROOT_All]
>>> 540D53BB-0DCE-DF11-B2D0-000423D2B9E8
>>> Domain[ROOT_All] Info
>>> rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
>>> RootDBase.open Error You cannot open the ROOT file
>>> [rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root]
>>> in mode READ if it does not exists.
>>> StorageSvc Error Cannot connect to Database:
>>> FID=540D53BB-0DCE-DF11-B2D0-000423D2B9E8
>>> PFN=rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root
>>> Domain[ROOT_All] Info > Deaccess DbDomain READ [ROOT_All]
>>> AthenaPoolConverter ERROR poolToObject:
>>> caught error: Could not connect to the file ( POOL :
>>> "PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )
>>> PixelDetectorManager ERROR Cannot find
>>> AlignableTransformContainer for key /Indet/Align - no misalignment
>>> PixelDetectorManager FATAL Unable to apply
>>> Inner Detector alignments
>>> *snip*
>>>
>>> Looked like a bad file to me, or maybe a bad pool node. Except I've check
>>> every replica of that file, they all exist. The pool nodes that they are on
>>> seem healthy. The checksums on all the replicas match (the adler32 is
>>> fa7e888b), the rfio logs on all the pools show no errors. There are no
>>> persistant offenders - hammercloud jobs accessing other files work fine on
>>> the same nodes these jobs fail on. Routing between all the workers and the
>>> pool nodes seem fine, and the network is uncongested at the moment.
>>>
>>> And I've now run out of ideas, have a fallen victim to a red herring or
>>> have a left something out that I can't see for staring too hard?
>>>
>>> Much obliged,
>>> Matt
>>>
>>
>>
>
|