Hi Andrew,
Thanks for the clarification. Is there an official way to request a
refresh (/etc/init.d/glite-wms-wm restart) ?
Cheers,
Luke
On 30 May 2014 13:50, Andrew Lahiff <[log in to unmask]> wrote:
> Hi Luke,
>
> One major difference is that the CMS SUM tests use a different WMS compared to the Ops tests. With a CMS proxy it seems that the RAL WMSs only like lcgce01, but the WMS used by the CMS SUM tests only like lcgce02 (*). Maybe both WMSs have stale (but different) information.
>
> Thanks,
> Andrew.
>
> (*)
> -bash-4.1$ glite-wms-job-list-match -c glite-wms-1.conf -a sleeper1.jdl | grep bris
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce02.phy.bris.ac.uk:2811/nordugrid-Condor-gridAMD
> - lcgce02.phy.bris.ac.uk:2811/nordugrid-Condor-gridIntel
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-medium
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-medium
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-medium
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-medium
> -bash-4.1$ glite-wms-job-list-match -a sleeper1.jdl | grep bris
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-express
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-long
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-short
> - lcgce01.phy.bris.ac.uk:2811/nordugrid-Condor-gridIntel
> - lcgce01.phy.bris.ac.uk:2811/nordugrid-Condor-gridAMD
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-medium
> - lcgce03.phy.bris.ac.uk:8443/cream-pbs-medium
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-medium
> - lcgce04.phy.bris.ac.uk:8443/cream-pbs-medium
>
>
> ________________________________________
> From: L Kreczko [[log in to unmask]]
> Sent: Friday, May 30, 2014 1:35 PM
> To: [log in to unmask]
> Subject: Debugging 'BrokerHelper: no compatible resources'.
>
> Dear experts,
>
> I am currently looking at the Nagios tests for two of our CEs (both
> ARC) that serve the same set of worker nodes. The configuration is
> identical on both CEs.
>
> GridPP Nagios tests
> (https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?navbarsearch=1&host=*bris.ac.uk)
> are fine for lcgce01, but (some) fail for lcgce02 with a timeout [1].
> SAM Nagios tests
> (https://sam-cms-prod.cern.ch/nagios/cgi-bin/status.cgi?navbarsearch=1&host=*bris.ac.uk)
> are fine for lcgce02 but fail for lcgce01.
>
> How can it be that a CE fails on one of the Nagios sites, but not the
> other? Surely the tests are identical except for the grid user (ops vs
> cms).
>
> I had Andrea's certificate (SAM Nagios) temporarily mapped to an ops
> admin account (instead of cms), but that does not make a difference.
> Any ideas how to debug this?
>
> Since lcgce01 disappeared from BDII for a day, the SAM nagios tests
> have been reset. This issue is now fixed. According to [2] this might
> be the reason for the SAM failures (the WMS has a stale image of the
> information system). What do you think?
>
> Cheers,
> Luke
>
> [1]Test: emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot
>
> CRITICAL: [Waiting->Cancelled [timeout/dropped]] 'BrokerHelper: no
> compatible resources'.
> https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
> CRITICAL: [Waiting->Cancelled [timeout/dropped]] 'BrokerHelper: no
> compatible resources'.
> https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
>
> Testing from: gridppnagios.physics.ox.ac.uk
> DN: /C=UK/O=eScience/OU=Oxford/L=OeSC/CN=kashif
> mohammad/CN=Robot:GridClient/CN=proxy
> VOMS FQANs: /ops/Role=pilot/Capability=NULL,
> /ops/NGI/Role=NULL/Capability=NULL,
> /ops/NGI/UK/Role=NULL/Capability=NULL, /ops/Role=NULL/Capability=NULL
> glite-wms-job-status https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
>
>
> ======================= glite-wms-job-status Success =====================
> BOOKKEEPING INFORMATION:
>
> Status info for the Job :
> https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
> Current Status: Waiting
> Status Reason: BrokerHelper: no compatible resources
> Submitted: Thu May 29 12:22:06 2014 BST
> ==========================================================================
> 45 min timeout for status [Waiting] exceeded. Cancelling the job.
> glite-wms-job-cancel --noint
> https://lcglb02.gridpp.rl.ac.uk:9000/H81u-19VOKlA8bMkas2-FQ
>
> Connecting to the service
> https://lcgwms04.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
>
>
> [2]
> https://wiki.egi.eu/wiki/Tools/Manuals/TS53
> --
> *********************************************************
> Dr Lukasz Kreczko +44 (0)117 928 8724
> CMS Group
> School of Physics
> University of Bristol
> *********************************************************
> --
> Scanned by iCritical.
--
*********************************************************
Dr Lukasz Kreczko +44 (0)117 928 8724
CMS Group
School of Physics
University of Bristol
*********************************************************
|