Dear experts,
I am currently looking at the Nagios tests for two of our CEs (both
ARC) that serve the same set of worker nodes. The configuration is
identical on both CEs.
GridPP Nagios tests
(https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?navbarsearch=1&host=*bris.ac.uk)
are fine for lcgce01, but (some) fail for lcgce02 with a timeout [1].
SAM Nagios tests
(https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?navbarsearch=1&host=*bris.ac.uk)
are fine for lcgce02 but fail for lcgce01.
How can it be that a CE fails on one of the Nagios sites, but not the
other? Surely the tests are identical except for the grid user (ops vs
cms).
I had Andrea's certificate (SAM Nagios) temporarily mapped to an ops
admin account (instead of cms), but that does not make a difference.
Any ideas how to debug this?
Since lcgce01 disappeared from BDII for a day, the SAM nagios tests
have been reset. This issue is now fixed. According to [2] this might
be the reason for the SAM failures (the WMS has a stale image of the
information system). What do you think?
Cheers,
Luke
[1]Test: emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot
CRITICAL: [Waiting->Cancelled [timeout/dropped]] 'BrokerHelper: no
compatible resources'.
https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
CRITICAL: [Waiting->Cancelled [timeout/dropped]] 'BrokerHelper: no
compatible resources'.
https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
Testing from: gridppnagios.physics.ox.ac.uk
DN: /C=UK/O=eScience/OU=Oxford/L=OeSC/CN=kashif
mohammad/CN=Robot:GridClient/CN=proxy
VOMS FQANs: /ops/Role=pilot/Capability=NULL,
/ops/NGI/Role=NULL/Capability=NULL,
/ops/NGI/UK/Role=NULL/Capability=NULL, /ops/Role=NULL/Capability=NULL
glite-wms-job-status https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job :
https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
Current Status: Waiting
Status Reason: BrokerHelper: no compatible resources
Submitted: Thu May 29 12:22:06 2014 BST
=========================================================================
45 min timeout for status [Waiting] exceeded. Cancelling the job.
glite-wms-job-cancel --noint
https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
Connecting to the service
https://lcgwms04.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
[2]
https://wiki.egi.eu/wiki/Tools/Manuals/TS53
--
*********************************************************
Dr Lukasz Kreczko +44 (0)117 928 8724
CMS Group
School of Physics
University of Bristol
*********************************************************
|