Dear All,
An older CE (lcgce02.phy.bris.ac.uk) (with SL4 HPC WN) started failing OPS
SAM tests (due to an SE problem) at 9am on 7 July; then at noon 7 July it
turned grey in SAM tests, orange in nagios tests.
What does grey in OPS SAM tests mean?
The main test intermittently failing was CE.org.sam.WN-Rep. The problem
has been found (we hope!) on the Storage Element (in fact if affected all
our CE since OPS SAM tests include writing to SE) & is now fixed.
During SE problems, the other CEs SAM went red (then green as SE problems
vanished); this older CE went solid grey & stayed grey.
Since SE fix, the other 2 CE have turned green (SAM & nagios) again.
This older CE is still grey in SAM, Orange in nagios.
The errors on this older CE changed from
at 10:53:48 on 07/07/2010
CRITICAL: METRIC FAILED [org.sam.WN-RepCr-/ops/Role=lcgadmin]:
CRITICAL: File was NOT copied to SE lcgse02.phy.bris.ac.uk and registered in
LFC prod-lfc-shared-central.cern.ch. [ErrDB:[('lcg_util_wn', 'server',
'CRITICAL')]] CLI
(Which the other CE were logging too, as all CE use the site SE)
to, at 12:23:47 on 07/07/2010 :
UNKNOWN: METRIC FAILED [org.sam.WN-RepCr-/ops/Role=lcgadmin]: UNKNOWN:
failed on LFC prod-lfc-shared-central.cern.ch [ErrDB:[('default', 'client',
'UNKNOWN')]]
None of the other CE do anything like that. This older CE has stayed at
that UNKNOWN ever since. What does this mean & any advice how to debug ??
This error with CE.org.sam.WN-Rep. also causes Nagios test
org.sam.WN-RepCr to be orange with error
UNKNOWN: failed on LFC prod-lfc-shared-central.cern.ch
[ErrDB:[('default', 'client', 'UNKNOWN')]]
and Nagios tests org.sam.WN-RepDel, org.sam.WN-RepGet & org.sam.WN-RepRep
to all have yellow error of
WARNING: Masked by org.sam.WN-RepCr-/ops/Role=lcgadmin - "UNKNOWN: failed on
LFC prod-lfc-shared-central.cern.ch [ErrDB:[('default', 'client',
'UNKNOWN')]]"
Grateful for clue/advice/hints!
|