Hi
The python error observed means that it can't connect. Which in
general indicates that I haven't set something quite right at RAL. I
should stress I won't do anything officially until I am sure it will
work. I will have a play around and see that everything is set up
correctly.
To go back to the discussion about why make this change, I will post
an extract from a discussion I have been having with the experts.
These are therefore the arguments for having everyone fail over to RAL:
> ...
>>> The solution of having all UK Tier 2s failing over to RAL Tier 1
>>> seems like a sensible solution as it simplifies things for sites.
>>> All the problems we have seen recently have been with configuration
>>> issues with the many layers of failovers rather than with sites
>>> actually failing. I was wondering what the opinions of the experts
>>> were regarding this?
>>
>> I think we're all (or at least mostly) or the mindset that T2 sites
>> should fail-over to their local T1 site, and that T1s should
>> therefore
>> permit source traffic from anywhere. As you say, this simplifies the
>> ACL configuration at T1 and reduces the chance that a T2 worker node
>> will be rejected. Also, site access control can be done at the
>> destination level, so that T1 resources are not open to attack or
>> exploit. Oracle resources are already open to the world in a similar
>> manner.
>
> I'm of the strong opinion that T2 squid proxies shouldn't fail over to
> other T2s, because of the difficulty in administration of permissions
> and because of the need to then over-engineer every T2 site to handle
> the full load of other sites. CMS has only one server site, at CERN,
> and all T1s, T2s, and T3s fail over to it. We make that one site
> have lots of extra capacity and watch it carefully to ensure that
> failures do not persist for long.
Myself and Catalin at the Tier 1 are attempting to at least be
familiar with most aspects of the Frontier service and we do attend
the meetings have contact with the developers. We can monitor the
service at Tier 1 more effectively than at several Tier 2 sites. So
far the ATLAS Frontier service in the UK has not really been managed
very well and its only thanks to Tiers 2 doing a very good job (and
possibly a bit of luck) of catching what ATLAS is saying that we have
had relatively few problems. In future I will also try and make sure
any requests for patches/upgrades etc get sent to this mail list.
Alastair
On 16 Jun 2010, at 09:42, Ben Waugh wrote:
> So does the 404 error mean the configuration error is with the RAL
> Squid rather than the local site? I don't mind which backup site we
> use as long as it works!
>
> Cheers,
> Ben
>
> On 16/06/10 09:18, Christopher J.Walker wrote:
>> Alastair Dewhurst wrote:
>>> Hi
>>>
>>> The current state of the ATLAS frontier service is not ideal. The
>>> SAM
>>> tests:
>>> https://lcg-sam.cern.ch:8443/sam/sam.py?CE_atlas_disp_tests=CE-
>>> ATLAS-sft-Frontier-
>>> Squid&order=SiteName&funct=ShowSensorTests&disp_status=na&disp_statu
>>> s=ok&disp_status=info&disp_status=note&disp_status=warn&disp_status=
>>> error&disp_status=crit&disp_status=maint
>>>
>>> show several production sites getting a warning. This warning is
>>> normally caused by the backup squid not being configured correctly.
>>>
>>
>> QMUL (and RHUL) are warning because RHUL hasn't got around to
>> configuring a squid after their shutdown.
>>
>> It's true that QMUL's squid wasn't working for them after I
>> upgraded it
>> - at Atlas's request. It's just a simple upgrade and will just work
>> apparently - though I do wish I'd been told it moved the config
>> files...
>>
>> In fact, that reminds me, which mailing list should I have been on to
>> get told about that request. A concern I have about installing
>> this sort
>> of one off software is that it doesn't get the routine security
>> updates
>> that SL does.
>>
>>> To remind people: WNs should connect to the local squid (normally at
>>> the site) which connects to the Frontier server at RAL. If the local
>>> squid is down then the WN will try and connect to a backup squid
>>> which
>>> is meant to be at a nearby site which will then try and connect
>>> to the
>>> Frontier server. There is a similar backup process should the
>>> Frontier
>>> server at RAL fail then all the squids will try and connect to the
>>> frontier server at PIC.
>>>
>>> To ease this problem it has been suggested that the default
>>> backup for
>>> Tier 2 sites is the squid at RAL (The Tier 1 not the Tier 2!). The
>>> squid at the Tier 1 is the same installation as the Frontier
>>> server so
>>> if the frontier services goes down so will the backup squid. This
>>> does
>>> reduce the resilience of the setup slightly but I think this is
>>> worth
>>> it given it should make things significantly simpler to maintain. It
>>> does also means I will have to get the SAM test modified
>>> slightly. If
>>> however there are sites that are happy with the current setup and
>>> managing firewall access to their squid from other sites worker
>>> nodes
>>> then please feel free to respond.
>>>
>>> Before committing any change to Tiersofatlas I would like sites
>>> to run
>>> a test to make sure they can indeed successfully access the RAL
>>> squid.
>>>
>>> To do this:
>>> Log into a WN
>>> > wget http://frontier.cern.ch/dist/fnget.py
>>> > export http_proxy=http://lcgft-atlas.gridpp.rl.ac.uk:3128
>>> > python fnget.py
>>> --url=http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/frontier
>>> --sql="SELECT TABLE_NAME FROM ALL_TABLES"
>>> This should provide a big list of table names and not a python
>>> error!
>>>
>>
>> You mean not like this...
>>
>> [walker@cn456 tmp]$ wget http://frontier.cern.ch/dist/fnget.py
>> --2010-06-16 09:07:11-- http://frontier.cern.ch/dist/fnget.py
>> Resolving frontier.cern.ch... 128.142.202.212
>> Connecting to frontier.cern.ch|128.142.202.212|:80... connected.
>> HTTP request sent, awaiting response... 200 OK
>> Length: 8406 (8.2K) [text/plain]
>> Saving to: `fnget.py'
>>
>> 100%[======================================>] 8,406 --.-K/s in 0.02s
>>
>> 2010-06-16 09:07:11 (434 KB/s) - `fnget.py' saved [8406/8406]
>>
>> [walker@cn456 tmp]$ export
>> http_proxy=http://lcgft-atlas.gridpp.rl.ac.uk:3128
>> [walker@cn456 tmp]$ python fnget.py
>> --url=http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/frontier
>> --sql="SELECT TABLE_NAME FROM ALL_TABLES"
>> Using Frontier URL:
>> http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/frontier
>> Query: SELECT TABLE_NAME FROM ALL_TABLES
>> Decode results: True
>> Refresh cache: False
>>
>> Frontier Request:
>> http://lcgft-atlas.gridpp.rl.ac.uk:3128/frontierATLAS/frontier?
>> type=frontier_request:
>> 1:DEFAULT&encoding=BLOBzip&p1=eNoLdvVxdQ5RCHF08nGN93P0dVVwC-
>> L3VXD08YkHiwUDAJs3CTA_
>>
>>
>> Query started: 06/16/10 09:07:25 BST
>> Traceback (most recent call last):
>> File "fnget.py", line 231, in ?
>> result = urllib2.urlopen(request).read()
>> File "/usr/lib64/python2.4/urllib2.py", line 130, in urlopen
>> return _opener.open(url, data)
>> File "/usr/lib64/python2.4/urllib2.py", line 364, in open
>> response = meth(req, response)
>> File "/usr/lib64/python2.4/urllib2.py", line 471, in http_response
>> response = self.parent.error(
>> File "/usr/lib64/python2.4/urllib2.py", line 402, in error
>> return self._call_chain(*args)
>> File "/usr/lib64/python2.4/urllib2.py", line 337, in _call_chain
>> result = func(*args)
>> File "/usr/lib64/python2.4/urllib2.py", line 480, in
>> http_error_default
>> raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
>> urllib2.HTTPError: HTTP Error 404: Not Found
>>
>>> Could sites please reply with the results of the test and any
>>> comments
>>> are also welcome.
>>>
>>>
>
> --
> Dr Ben Waugh Tel. +44 (0)20 7679
> 7223
> Dept of Physics and Astronomy Internal: 37223
> University College London
> London WC1E 6BT
|