Print

Print


Hi Alistair,

Yep, the test works from Glasgow's worker nodes. I'd also agree with
Graeme's comments; we should be switched from FZK to RAL for our
backup.

Cheers
Mike

On 17 June 2010 11:49, Alastair Dewhurst <[log in to unmask]> wrote:
> Hi All
>
> After a discussion in today Thursday phone meeting we have decided the
> following:
>
> 1) If you have been passing the SAM tests and are happy with your current
> setup then no changes will be made to effect your site.
> 2) If you have been failing (getting a warning) on the SAM tests I will
> switch you over to having the RAL Tier 1 as your primary backup.
> 3) If you would prefer to have RAL as your primary backup which will allow
> things to be more easily monitored from the Tier 1 then I will switch you
> over too.
>
> I would appreciate it if all sites, even if they don't want anything changed
> did run the test as it does prove that direct access works (incase of
> emergency).
>
> Site              : Test (Who ran it)                     : SAM : Preference
> RAL PP       : Passed (Alastair Dewhurst) : ok      : Use Tier 1
> Liverpool    : Passed (Stephen Jones)      : ok      : unknown
> QMUL          : Passed (Chris Walker)         : ok      : unknown
> Cambridge : Passed (Santanu Das)         : warn : Will be changed
> Sheffield     : Passed (Elena Korolkova)   : ok      : unknown
> RHUL          : Passed (Simon George)      : ok      : unknown
> UCL             : Passed (Ben Waugh)           : ok      : unknown
> Manchester: Not run                                    : ok       : Stay the
> same
> Lancaster   : Not run                                     : ok      : Stay
> the same
> Oxford         : Not run                                     : warn : Will
> be changed
> Birmingham: Not run                                    : warn : Will be
> changed
> Glasgow     : Not run                                     : ok      :
> unknown, although still uses FZK which Graeme Stewart said should be
> changed.
>
> I am still trying to sort out some new monitoring for the Tier 1 and I will
> send out a confirmation before submitting any request to change
> Tiersofatlas.  If anyone has any additional suggestions regarding monitoring
> and chasing up failures that is very welcome.  As was said at the meeting,
> this is a setup that seems to work very well most of the time, it is really
> a question of how best to chase up the few problems when they occur without
> creating lots of work for ourselves.
>
> Thanks
>
> Alastair
>
>
> On 17 Jun 2010, at 10:22, Ben Waugh wrote:
>
>> This works for UCL (both HEP and Legion clusters).
>>
>> Cheers,
>> Ben
>>
>> On 16/06/10 12:14, Alastair Dewhurst wrote:
>>>
>>> Hi Santanu
>>> Thank you for spotting that, it should indeed be a capital F.  I thought
>>> I had copied and pasted the commands directly but maybe my mail client
>>> decided to do some formatting.  That should fix most of the problems as the
>>> Frontier server/squid should be accessible to all.
>>> If we were to make this change, it would not make RAL a single point of
>>> failure.  In order for their to be a failure both your own squid and RAL
>>> would have to fail  If RAL fails your own squid should be set up to access
>>> PIC.  The current situation means that if you and your back squid fail,
>>> things will break.  (If both RAL and PIC are down then you will also fail
>>> under both systems but multiple T1 failures should hopefully be rare!)
>>> Alastair
>>> So the new instructions are:
>>> Log into a WN
>>>  > wget http://frontier.cern.ch/dist/fnget.py
>>>  > export http_proxy=http://lcgft-atlas.gridpp.rl.ac.uk:3128
>>>  > python fnget.py
>>> --url=http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/Frontier
>>> --sql="SELECT TABLE_NAME FROM ALL_TABLES"
>>> This should provide a big list of table names and not a python error!
>>> On 16 Jun 2010, at 11:51, Santanu Das wrote:
>>>>
>>>> Hi Alastair and all,
>>>>
>>>> I think there is typo in the URL, it should be
>>>> "http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/Frontier" *not*
>>>> "frontier" with small f. Now it works for with or without a http_proxy
>>>> setting.
>>>>
>>>> [root@farm002 tmp]# unset http_proxy [root@farm002 tmp]# python fnget.py
>>>> --url=http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/Frontier
>>>> --sql="SELECT count(*) FROM ALL_TABLES" Using Frontier URL:
>>>> http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/Frontier Query: SELECT
>>>> count(*) FROM ALL_TABLES Decode results: True Refresh cache: False Frontier
>>>> Request:
>>>> http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/Frontier?type=frontier_request:1:DEFAULT&encoding=BLOBzip&p1=eNoLdvVxdQ5RSM4vzSvR0NJUcAvy91Vw9PGJD3F08nENBgCQ9wjs
>>>> Query started: 06/16/10 11:44:15 BST Query ended: 06/16/10 11:44:16 BST
>>>> Query time: 1.34605288506 [seconds] Query result: <?xml version="1.0"
>>>> encoding="US-ASCII"?> <!DOCTYPE frontier SYSTEM
>>>> "http://frontier.fnal.gov/frontier.dtd"> <frontier version="3.22"
>>>> xmlversion="1.0"> <transaction payloads="1"> <payload
>>>> type="frontier_request" version="1" encoding="BLOBzip">
>>>> <data>eJxjY2Bg4HD2D/UL0dDSZANy2PxCfZ1cg9hBbBYLC2NjdgBW1ATW</data> <quality
>>>> error="0" md5="3c31cc5665b2636e8feb209fafa558f6" records="1"
>>>> full_size="35"/> </payload> </transaction> </frontier> Fields: COUNT(*)
>>>> NUMBER Records: 8833 Cheers,
>>>> Santanu
>>>>
>>>>
>>>>> Hi
>>>>>
>>>>> The current state of the ATLAS frontier service is not ideal.  The SAM
>>>>> tests:
>>>>>
>>>>> https://lcg-sam.cern.ch:8443/sam/sam.py?CE_atlas_disp_tests=CE-ATLAS-sft-Frontier-Squid&order=SiteName&funct=ShowSensorTests&disp_status=na&disp_status=ok&disp_status=info&disp_status=note&disp_status=warn&disp_status=error&disp_status=crit&disp_status=maint
>>>>> show several production sites getting a warning.  This warning is
>>>>> normally caused by the backup squid not being configured correctly.
>>>>>
>>>>> To remind people: WNs should connect to the local squid (normally at
>>>>> the site) which connects to the Frontier server at RAL.  If the local squid
>>>>> is down then the WN will try and connect to a backup squid which is meant to
>>>>> be at a nearby site which will then try and connect to the Frontier server.
>>>>>  There is a similar backup process should the Frontier server at RAL fail
>>>>> then all the squids will try and connect to the frontier server at PIC.
>>>>>
>>>>> To ease this problem it has been suggested that the default backup for
>>>>> Tier 2 sites is the squid at RAL (The Tier 1 not the Tier 2!).  The squid at
>>>>> the Tier 1 is the same installation as the Frontier server so if the
>>>>> frontier services goes down so will the backup squid.  This does reduce the
>>>>> resilience of the setup slightly but I think this is worth it given it
>>>>> should make things significantly simpler to maintain.  It does also means I
>>>>> will have to get the SAM test modified slightly.  If however there are sites
>>>>> that are happy with the current setup and managing firewall access to their
>>>>> squid from other sites worker nodes then please feel free to respond.
>>>>>
>>>>> Before committing any change to Tiersofatlas I would like sites to run
>>>>> a test to make sure they can indeed successfully access the RAL squid.
>>>>>
>>>>> To do this:
>>>>> Log into a WN
>>>>> > wget http://frontier.cern.ch/dist/fnget.py
>>>>> > export http_proxy=http://lcgft-atlas.gridpp.rl.ac.uk:3128
>>>>> > python fnget.py
>>>>> > --url=http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/frontier
>>>>> > --sql="SELECT TABLE_NAME FROM ALL_TABLES"
>>>>> This should provide a big list of table names and not a python error!
>>>>>
>>>>> Could sites please reply with the results of the test and any comments
>>>>> are also welcome.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Alastair
>>>>
>>
>> --
>> Dr Ben Waugh                                   Tel. +44 (0)20 7679 7223
>> Dept of Physics and Astronomy                  Internal: 37223
>> University College London
>> London WC1E 6BT
>