On 02/01/13 15:46, John Hill wrote:
> Happy New Year,
> Earlier this afternoon we clocked up some NAGIOS errors in
> org.sam.WN-Rep (among others). I saw similar errors at the 5% level for
> about a week before Christmas (roughly 10-18 December) but they then
> stopped and all was OK over the holiday period until this lunchtime.
QMUL sees a similar pattern.
> I
> can see no problems on the site: the SE looks happy, the issue happens
> on random WNs, and I've made no changes since the start of December
> (other than to perform a rolling upgrade on the WNs). Also when I look
> at NAGIOS I see what appear to be superficially similar issues at other
> sites - but they don't get penalised in the NAGIOS Availability or
> Reliability statistics.
> Given that we've been just as busy over the holiday period as before or
> after it (so the problem is unlikely to be load-related) I'm not
> convinced that the problem is at Cambridge - in which case why are we
> the only ones to get punished?
QMUL is seeing similar sounding failures in ATLAS tests. I believe the
problem is that the BDII is not returning correct information. See:
https://ggus.eu/ws/ticket_info.php?ticket=89733
The example given is for a WMS, but I see failures that could be caused
by a similar problem with our SE too.
> On the other hand, if someone has a good
> idea where to look to identify the problem, I'd be grateful.
I don't believe you are alone in seeing this. Why it should have been
better over the holiday, I really don't know.
Chris
|