Ewan MacMahon wrote: >> -----Original Message----- >> From: Testbed Support for GridPP member institutes [mailto:TB- >> >> Hi >> >> The current state of the ATLAS frontier service is not ideal. The >> SAM tests: >> show several production sites getting a warning. This warning is >> normally caused by the backup squid not being configured correctly. > >> If however there are sites that are happy >> with the current setup and managing firewall access to their squid >> from other sites worker nodes then please feel free to respond. >> > I'm happy in principle with the status quo. In practice however, > we just unbusted Oxford's configuration as RALPPD's backup the > other day. Prior to that it's never been right. The only reason > for that is that we'd never noticed that we were supposed to be > RALPPD's backup in the first place. If you can tell from testing > which sites have a problem with their backups, then presumably > ATLAS has known the entire time that it was broken, but have > neglected to mention it to us. > > Rather than looking at the problems, giving up and trying > something else, could you not first ask people to fix what we > already have? I strongly agree that there is a tendency in the grid to add extra layers of complexity in order to bodge around problems - rather than actually doing anything to fix the problems. This is a bad thing. Even worse, this also tends to hide the original problem - and nobody notices until the failover mechanism starts failing as well - and that makes the whole thing more difficult to debug. I also think that we should move towards the situation where RAL isn't a single point of failure for UK Tier-2s. However, in this case, I'm (weakly) inclined towards RAL being the failover. That means I don't have to worry about squid configs allowing traffic into QMUL. It also means that the failover site is likely to get exercised on a reasonably regular basis - 20 times as often as an individual Tier-2 being failed over to. Chris