JISCMail - LCG-ROLLOUT Archives

Hi Stephen,

this is related to our previous emails.

Burke, S (Stephen) via RT wrote:
> For real production use by experiments the RB usually looks at a
> VO-specific BDII which has a filtered list of sites, e.g. to take out
> the ones which are failing the monitoring tests. However, we also need
> an all-site BDII (known as the test zone for historical reasons) to
> allow the tests to be run in the first place. Different VOs may well
> have different criteria for considering a site to be good. "Production"
> in the CE status just means that the queue is accepting jobs, it doesn't
> imply that they will run successfully!

The way problem tracking is currently done is nearly humorous... read this:

Our site HG-01-GRNET suddenly remained without atlas jobs for a few days...
why?

Because there were new CA rpms...
and we did install them promptly...
which the older Site Functional Tests didn't like...
then some magic hand got us out of ATLAS bdii,
not because *we* had a problem,
but only because the *SFTs* were interpreted the wrong way.

I find SFTs excellent as reference material for debugging, but not more.

If what I just said was only the experience of a single site, I may ask sorry
for taking your time. But if not, some procedures have to be reconsidered,
if we call this a production grid...

--
echo "sysadmin know better bash than english" | sed s/min/mins/ \
        | sed 's/better bash/bash better/' # Yelling in a CERN forum