Ewan MacMahon wrote:
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>
>> Hi
>>
>> The current state of the ATLAS frontier service is not ideal. The
>> SAM tests:
>> show several production sites getting a warning. This warning is
>> normally caused by the backup squid not being configured correctly.
>
>> If however there are sites that are happy
>> with the current setup and managing firewall access to their squid
>> from other sites worker nodes then please feel free to respond.
>>
> I'm happy in principle with the status quo. In practice however,
> we just unbusted Oxford's configuration as RALPPD's backup the
> other day. Prior to that it's never been right. The only reason
> for that is that we'd never noticed that we were supposed to be
> RALPPD's backup in the first place. If you can tell from testing
> which sites have a problem with their backups, then presumably
> ATLAS has known the entire time that it was broken, but have
> neglected to mention it to us.
>
> Rather than looking at the problems, giving up and trying
> something else, could you not first ask people to fix what we
> already have?
I strongly agree that there is a tendency in the grid to add extra
layers of complexity in order to bodge around problems - rather than
actually doing anything to fix the problems. This is a bad thing. Even
worse, this also tends to hide the original problem - and nobody notices
until the failover mechanism starts failing as well - and that makes the
whole thing more difficult to debug.
I also think that we should move towards the situation where RAL isn't a
single point of failure for UK Tier-2s.
However, in this case, I'm (weakly) inclined towards RAL being the
failover. That means I don't have to worry about squid configs allowing
traffic into QMUL. It also means that the failover site is likely to get
exercised on a reasonably regular basis - 20 times as often as an
individual Tier-2 being failed over to.
Chris
|