On 02/01/13 16:42, John Kelly wrote:
> Hi,
> We did notice a glitch with the top level bdiis earlier today. I am not
> especially familiar with the bdiis so I did not investigate. The problem
> seemed to have fixed itself after about 10 minutes.
>
Thanks.
Jonathan Perkin from T2K.org reports the same
"lcg-cr --vo t2k.org <http://t2k.org> <http://t2k.org> -d
se03.esc.qmul.ac.uk <http://esc.qmul.ac.uk>
<http://esc.qmul.ac.uk> -P lcgCrTestfile.26031 -l
lfn:/grid/t2k.org/test/lcgCrTestfile.26031
<http://t2k.org/test/lcgCrTestfile.26031>
<http://t2k.org/test/lcgCrTestfile.26031>
file:lcgCrTestfile.26031
[GFAL][get_storage_path][] [BDII]
[g1_sd_get_storage_path_local]: No
GlueSA information found about VO and SE.
lcg_cr: Invalid argument"
"I'd guess I saw the problem around 13:00±2h.", and that it hasn't
cleared up.
Steve Lloyd's tests see similar errors (though that reports not being
able to contact the BDII, rather than not finding information):
Eg:
SE Test for UKI-NORTHGRID-LIV-HEP_hepgrid11 at 03 Jan 2013 19:45:01
Delete replica from se hepgrid11.ph.liv.ac.uk:
lcg-del -v --vo atlas -s hepgrid11.ph.liv.ac.uk
lfn:/grid/atlas/users/lloyd/setest_UKI-NORTHGRID-LIV-HEP_hepgrid11_03_Jan_2013_19_45_01.dat
VO name: atlas
[BDII][ldap_simple_bind_s][] lcg-bdii.gridpp.ac.uk:2170 > Can't contact
LDAP server
[GFAL][bdii_query_send][EINVAL] No accessible BDII
> I will ask Catalin about this when he is available.
Thanks. Whilst I don't think this needs precipitate action, it is
causing a significant number of failures, so there is a certain amount
of urgency in fixing this. I don't know how early in January the plan to
move to SL6 bdii is, but if it is late January, I should perhaps point
at one of the other top level BDIIs by default. We currently fail back
to Imperial, but if we get incorrect information, rather than a failure,
that failover won't happen.
Chris
> regards,
>
> John Kelly
> AoD RAL tier1
>
> On 02/01/13 16:24, John Hill wrote:
>> Hi Kashif, Chris,
>> Thanks for the feedback. For now I will assume the problem is not at
>> Cambridge - though I don't understand why we see NAGIOS failures on
>> http://pprc.qmul.ac.uk/~lloyd/gridpp/nagios.html and other sites don't.
>> Cheers,
>> John
>>
>> On 02/01/2013 16:09, Kashif Mohammad wrote:
>>> Hi John
>>>
>>> I am seeing this error on our CE's as well and probably reasonis that
>>> RAL topbdii was in unstable state for half an hour this afternoon.
>>>
>>> https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/trends.cgi?host=lcgbdii.gridpp.rl.ac.uk&service=org.bdii.Freshness
>>>
>>>
>>> WN-rep uses lcg-cp to transfer a test file from WN to storage and it
>>> request SE endpoint from topbdii.
>>>
>>> Cheers
>>> Kashif
>>>
>>> -----Original Message-----
>>> From: Testbed Support for GridPP member institutes
>>> [mailto:[log in to unmask]] On Behalf Of John Hill
>>> Sent: 02 January 2013 15:46
>>> To: [log in to unmask]
>>> Subject: Intermittent NAGIOS failures
>>>
>>> Happy New Year,
>>> Earlier this afternoon we clocked up some NAGIOS errors in
>>> org.sam.WN-Rep (among others). I saw similar errors at the 5% level for
>>> about a week before Christmas (roughly 10-18 December) but they then
>>> stopped and all was OK over the holiday period until this lunchtime. I
>>> can see no problems on the site: the SE looks happy, the issue happens
>>> on random WNs, and I've made no changes since the start of December
>>> (other than to perform a rolling upgrade on the WNs). Also when I look
>>> at NAGIOS I see what appear to be superficially similar issues at other
>>> sites - but they don't get penalised in the NAGIOS Availability or
>>> Reliability statistics.
>>> Given that we've been just as busy over the holiday period as before
>>> or after it (so the problem is unlikely to be load-related) I'm not
>>> convinced that the problem is at Cambridge - in which case why are we
>>> the only ones to get punished? On the other hand, if someone has a good
>>> idea where to look to identify the problem, I'd be grateful.
>>>
>>> John
>>>
>
|