Hi,
We did notice a glitch with the top level bdiis earlier today. I am not
especially familiar with the bdiis so I did not investigate. The problem
seemed to have fixed itself after about 10 minutes.
I will ask Catalin about this when he is available.
regards,
John Kelly
AoD RAL tier1
On 02/01/13 16:24, John Hill wrote:
> Hi Kashif, Chris,
> Thanks for the feedback. For now I will assume the problem is not
> at Cambridge - though I don't understand why we see NAGIOS failures on
> http://pprc.qmul.ac.uk/~lloyd/gridpp/nagios.html and other sites don't.
> Cheers,
> John
>
> On 02/01/2013 16:09, Kashif Mohammad wrote:
>> Hi John
>>
>> I am seeing this error on our CE's as well and probably reasonis that
>> RAL topbdii was in unstable state for half an hour this afternoon.
>>
>> https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/trends.cgi?host=lcgbdii.gridpp.rl.ac.uk&service=org.bdii.Freshness
>>
>>
>> WN-rep uses lcg-cp to transfer a test file from WN to storage and it
>> request SE endpoint from topbdii.
>>
>> Cheers
>> Kashif
>>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]] On Behalf Of John Hill
>> Sent: 02 January 2013 15:46
>> To: [log in to unmask]
>> Subject: Intermittent NAGIOS failures
>>
>> Happy New Year,
>> Earlier this afternoon we clocked up some NAGIOS errors in
>> org.sam.WN-Rep (among others). I saw similar errors at the 5% level for
>> about a week before Christmas (roughly 10-18 December) but they then
>> stopped and all was OK over the holiday period until this lunchtime. I
>> can see no problems on the site: the SE looks happy, the issue happens
>> on random WNs, and I've made no changes since the start of December
>> (other than to perform a rolling upgrade on the WNs). Also when I look
>> at NAGIOS I see what appear to be superficially similar issues at other
>> sites - but they don't get penalised in the NAGIOS Availability or
>> Reliability statistics.
>> Given that we've been just as busy over the holiday period as
>> before
>> or after it (so the problem is unlikely to be load-related) I'm not
>> convinced that the problem is at Cambridge - in which case why are we
>> the only ones to get punished? On the other hand, if someone has a good
>> idea where to look to identify the problem, I'd be grateful.
>>
>> John
>>
--
Scanned by iCritical.
|