Hi Kashif, Chris,
Thanks for the feedback. For now I will assume the problem is not
at Cambridge - though I don't understand why we see NAGIOS failures on
http://pprc.qmul.ac.uk/~lloyd/gridpp/nagios.html and other sites don't.
Cheers,
John
On 02/01/2013 16:09, Kashif Mohammad wrote:
> Hi John
>
> I am seeing this error on our CE's as well and probably reasonis that RAL topbdii was in unstable state for half an hour this afternoon.
>
> https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/trends.cgi?host=lcgbdii.gridpp.rl.ac.uk&service=org.bdii.Freshness
>
> WN-rep uses lcg-cp to transfer a test file from WN to storage and it request SE endpoint from topbdii.
>
> Cheers
> Kashif
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of John Hill
> Sent: 02 January 2013 15:46
> To: [log in to unmask]
> Subject: Intermittent NAGIOS failures
>
> Happy New Year,
> Earlier this afternoon we clocked up some NAGIOS errors in
> org.sam.WN-Rep (among others). I saw similar errors at the 5% level for
> about a week before Christmas (roughly 10-18 December) but they then
> stopped and all was OK over the holiday period until this lunchtime. I
> can see no problems on the site: the SE looks happy, the issue happens
> on random WNs, and I've made no changes since the start of December
> (other than to perform a rolling upgrade on the WNs). Also when I look
> at NAGIOS I see what appear to be superficially similar issues at other
> sites - but they don't get penalised in the NAGIOS Availability or
> Reliability statistics.
> Given that we've been just as busy over the holiday period as before
> or after it (so the problem is unlikely to be load-related) I'm not
> convinced that the problem is at Cambridge - in which case why are we
> the only ones to get punished? On the other hand, if someone has a good
> idea where to look to identify the problem, I'd be grateful.
>
> John
>
|