hi all
we're observing random sam failures since yesterday.
for some reason some of our site bdii parameters don't make
it into the central bdii anymore.
we've had three failures and 7 OKs during the night, when the config
was not changed for sure.
https://lcg-sam.cern.ch:8443/sam/sam.py?funct=ShowHistory&sensors=SE&vo=ops&nodename=storage01.lcg.cscs.ch
the SAM test scripts complain about not
being able to find some Glue schema attribute, which is OK on the site
BDII. for instance, an excerpt from the failing "cr" test at 5am::
+ lcg-cr -v --vo ops file:/home/samops/.same/SE/testFile.txt -l lfn:SE-lcg-cr-storage01.lcg.cscs.ch-1207720006 -d storage01.lcg.cscs.ch
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SE-lcg-cr-storage01.lcg.cscs.ch-1207720006
sam-bdii.cern.ch:2170: No GlueSEName found for storage01.lcg.cscs.ch
still, on the site BDII ce01::
[root@ce01 ~]# ldapsearch -x -H ldap://ce01.lcg.cscs.ch:2170/ -b mds-vo-name=CSCS-LCG2,o=grid | grep GlueSEName
GlueSEName: [log in to unmask]:SRM
[root@ce01 ~]# ldapsearch -x -H ldap://storage01.lcg.cscs.ch:2170/ -b mds-vo-name=resource,o=grid | grep GlueSEName
GlueSEName: [log in to unmask]:SRM
any clues? maybe just one of the many top-level sam-bdii machines at cern is misbehaving?
btw, CMS has observed the same behavior yesterday: some of their tags
did not make it into the top-level bdii but were published correctly in our sBDII
and then of course jobs failed because no resources were found.
peter
--
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| Dr. Peter Kunszt
| Head of Distributed High Throughput Computing Unit
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| /\ \ Swiss National Supercomputing Centre CSCS
| \/ / Via Cantonale - Galleria 2
| /\ \ /\ \ 6928 Manno
| \/ / \/ / Switzerland
| /\ \
| \/ / Tel. +41 91 610 8222 Fax. +41 91 610 8282
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|