Hi,
I'd like to announce there is an activity within COD group to get rid of
SAM problems related to Core Services from being assigned to sites. For
example if there was a BDII timeout while performing Replica Management
test the site will not be presented with that failure in the weekly report,
no SAM alarms will be raised, no GGUS tickets.
The approach for identification of such cases is to search for a
specific string in the test's error message i.e. "ldap_bind: Can't
contact LDAP server" and report different than 'ERROR' SAM test code.
We've already identified a few cases which can be count as a Core
Service failure and several which cannot. They are all listed on the
wiki page at:
http://goc.grid.sinica.edu.tw/gocwiki/Tools_Improvements_for_COD/FailuresDueToCoreServices
I'd like to welcome anybody willing to help in identification of another
Core Service failures cases to send an e-mail to me and [log in to unmask]
with the details on the error and possibly with the link to SAM results
web page showing the mentioned error.
At the moment we improved Replica Mgmt tests, the results can be seen at:
https://sam.cyfronet.pl/sam-egee/sam.py?CE_dteam_disp_tests=CE-devel-lcg-rm-del&CE_dteam_disp_tests=CE-devel-lcg-rm-rep&CE_dteam_disp_tests=CE-devel-lcg-rm-cp&CE_dteam_disp_tests=CE-devel-lcg-rm-cr&CE_dteam_disp_tests=CE-devel-lcg-rm-gfal&order=RegionName&funct=ShowSensorTests&disp_status=na&disp_status=ok&disp_status=info&disp_status=note&disp_status=warn&disp_status=error&disp_status=crit&disp_status=maint
Currently they report "info" in case of a Core Service failure instead
of "error".
Any comments from those who want to improve the current state are greatly appreciated.
Regards,
Marcin
CE ROC COD Team
|