JISCMail - TB-SUPPORT Archives

Thanks Alastair and Alessandra,

Trying the test python script Alessandra pointed me things seem to work, 
but there are clashes between the "native" HTTP_PROXY variable (for the 
Lancaster webcache) and what we need to export to point at the frontier. 
We have both HTTP_PROXY and http_proxy set, and after playing it seems 
the capatilised variable tables precedence. This behavior could be 
mucking up worker node communication to the frontier squid (which is 
otherwise healthy).

Do UCL also have a web proxy to deal with? If we both are seeing similar 
problems and we both have web proxies it could explain much.

Or I could have found another red herring.

Thanks,
Matt


Alastair Dewhurst wrote:
> Hi
> 
> I have been looking at this issue as well as it seems to be a similar 
> problem with UCL.
> 
> We have a monitoring page at RAL for Frontier access:
> http://ganglia.gridpp.rl.ac.uk/cgi-bin/ganglia-squid/squid-page.pl?r=day&s=normal&.submit=Submit+Query 
> 
> Lancaster squid appears to be working properly.  Oxford and Liverpool's 
> squids aren't and I need to send them a ticket about it (although its 
> not top priority).
> There is also some useful instruction for testing if your squid/frontier 
> access is working which can be found here:
> https://twiki.cern.ch/twiki/bin/view/Atlas/T2SquidDeployment
> If you have problems accessing this page I apologies.  I have asked for 
> it to be made publicly available several times but every now and then 
> somebody decides to restrict access to ATLAS only....
> 
> 
> However I don't think it's a Frontier/Squid issue (although I am not 
> confident of this).  There errors you are getting is:
>>>> PixelDetectorManager                                FATAL Unable to 
>>>> apply Inner Detector alignments
> As far as I am aware detector alignment is always done through flat 
> conditions files in the database release and is not got from Frontier (I 
> may be wrong and this may have changed).  Also Lancaster is fliping 
> between online and offline so it appears like some load issue.
> 
> These test are new and I don't fully understand them, so I have emailed 
> the people who designed them to see what they think might be the problem.
> 
> Alastair
> 
> 
> 
> 
> 
> On 31 May 2011, at 15:02, Matt Doidge wrote:
> 
>> Thanks Wahid, turning down a blind alley is still better then smacking 
>> your head against a brick wall!
>>
>> Our squid appears to be working, but I don't see workernode IPs or the 
>> worker node's nat box IP in the access.log - looks like there could be 
>> a communication problem.
>>
>> Does anyone know if there is a script/tool/simple command that one can 
>> use to mimic a query to the frontier server?
>>
>> Thanks again,
>> Matt
>>
>> Wahid Bhimji wrote:
>>> Matt I don't have the answer for you. But dq2 agrees that is the 
>>> correct checksum for the file [ ]    
>>> cond10_data.000019.gen.COND._0003.pool.root    
>>> 540D53BB-0DCE-DF11-B2D0-000423D2B9E8    ad:fa7e888b    570085
>>> and I can copy it and it looks sound.
>>> So I conclude there is nothing wrong with the file. But perhaps it 
>>> shouldn't be looking for it at all. It looks to me a bit like all the 
>>> test jobs running on "real" data fail at lancs (ie data* rather than 
>>> mc* datasets). So perhaps (and this a total guess) it should be 
>>> getting this from the squid. Actually I no longer know how to check 
>>> squid health as it seems to be removed from the sam tests ...
>>> This link _doesn't_ show the squid test
>>> https://lcg-sam.cern.ch:8443/sam/sam.py?CE_atlas_disp_tests=CE-ATLAS-sft-Frontier-Squid&order=SiteName&funct=ShowSensorTests&disp_status=na&disp_status=ok&disp_status=info&disp_status=note&disp_status=warn&disp_status=error&disp_status=crit&disp_status=maint 
>>>
>>> is there actually a nagios for atlas sam tests now (a side question)?
>>> Sorry thats not actually helpful... and may send you down an even 
>>> blinder alley.
>>> Wahid
>>> On 31 May 2011, at 13:12, Matt Doidge wrote:
>>>> Hello everybody,
>>>>
>>>> I'm scratching my head trying to figure out the cause of some 
>>>> hammercloud failures at our site and I thought I'd turn to my 
>>>> comrades across the nation for help.
>>>>
>>>> We're regularly failing hammercloud tests with the same type of 
>>>> failure - athena fails to open a conditions file then crashes.
>>>> Take job:
>>>> http://panda.cern.ch/server/pandamon/query?job=*&jobsetID=6138851&user=Johannes%20Elmsheuser&days=0 
>>>>
>>>>
>>>> The oft misleading log extracts have the interesting lines:
>>>> SysError in : file 
>>>> rfio:////dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root 
>>>> can not be opened for reading (System error)
>>>> Error in : Cannot build a TreeIndex with a Tree having no entries
>>>>
>>>> Delving deeper into the athena_stdout.txt I see what could be 
>>>> interesting:
>>>> *snip*
>>>> Domain[ROOT_All] Info >   Access   DbDomain     READ      [ROOT_All]
>>>> Domain[ROOT_All] Info ->  Access   DbDatabase   READ      [ROOT_All] 
>>>> 540D53BB-0DCE-DF11-B2D0-000423D2B9E8
>>>> Domain[ROOT_All] Info 
>>>> rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root 
>>>>
>>>> RootDBase.open Error You cannot open the ROOT file 
>>>> [rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root] 
>>>> in mode READ if it does not exists.
>>>> StorageSvc Error Cannot connect to Database: 
>>>> FID=540D53BB-0DCE-DF11-B2D0-000423D2B9E8 
>>>> PFN=rfio:/dpm/lancs.ac.uk/home/atlas/atlashotdisk/cond10_data/000019/gen/cond10_data.000019.gen.COND/cond10_data.000019.gen.COND._0003.pool.root 
>>>>
>>>> Domain[ROOT_All] Info >   Deaccess DbDomain     READ      [ROOT_All]
>>>> AthenaPoolConverter                                 ERROR 
>>>> poolToObject: caught error: Could not connect to the file ( POOL : 
>>>> "PersistencySvc::UserDatabase::connectForRead" from "PersistencySvc" )
>>>> PixelDetectorManager                                ERROR Cannot 
>>>> find AlignableTransformContainer for key /Indet/Align - no misalignment
>>>> PixelDetectorManager                                FATAL Unable to 
>>>> apply Inner Detector alignments
>>>> *snip*
>>>>
>>>> Looked like a bad file to me, or maybe a bad pool node. Except I've 
>>>> check every replica of that file, they all exist. The pool nodes 
>>>> that they are on seem healthy. The checksums on all the replicas 
>>>> match (the adler32 is fa7e888b), the rfio logs on all the pools show 
>>>> no errors. There are no persistant offenders - hammercloud jobs 
>>>> accessing other files work fine on the same nodes these jobs fail 
>>>> on. Routing between all the workers and the pool nodes seem fine, 
>>>> and the network is uncongested at the moment.
>>>>
>>>> And I've now run out of ideas, have a fallen victim to a red herring 
>>>> or have a left something out that I can't see for staring too hard?
>>>>
>>>> Much obliged,
>>>> Matt
>>>>