Hi Maarten,
Certainly heplnx206 was in a state again. The query also failed locally but
after restarting the gLite services it works both locally and at CERN.
I don't know whether this is the same problem or a different one since even
local connections seemed to be failing now.
For some previous incidents we have certainly been blacklisted by one WMS
for long periods but happily accepting jobs from another.
Compare the CMS SAM test results for Sept 05-07[1] with the Ops ones[2]
(Ignore the problem of the morning of 5th Sept, that was the RAL Site
Network failing which also took the T1 offline).
It doesn't help that the something doesn't seem to have noticed that we've
reinstalled heplnx207.pp.rl.ac.uk as a CreamCE as well and so the CERN Expt
SAM tests keep trying and failing to run the CE profile against it so it
seemas not to get any work and we have two CreamCEs doing the work of three.
Yours,
Chris.
[1]
http://dashb-nagios-cms-dev.cern.ch/dashboard/request.py/testhistory?siteSel
ect3=T2&sites=T2_UK_SGrid_RALPP&serviceTypeSelect3=all&services=CREAMCE&test
s=CREAMCE-org.cms.WN-analysis&tests=CREAMCE-org.cms.WN-basic&tests=CREAMCE-o
rg.cms.WN-frontier&tests=CREAMCE-org.cms.WN-mc&tests=CREAMCE-org.cms.WN-squi
d&tests=CREAMCE-org.cms.WN-swinst&tests=CREAMCE-org.cms.glexec.WN-gLExec&tes
ts=CREAMCE-org.sam.CREAMCE-JobSubmit&tests=CREAMCE-org.sam.glexec.CREAMCE-Jo
bSubmit&servicename=heplnx206.pp.rl.ac.uk&timeRange=individual&start=2011-09
-05&end=2011-09-07
http://bit.ly/qJf1BM
[2]
https://gridppnagios.physics.ox.ac.uk/myegi/history/?facelist_values_regions
=&facelist_values_sites=&facelist_values_services=6253%2C&profile=5-1&monito
red=2&status=1&status=2&status=3&status=4&status=5&startdate=05-09-2011&endd
ate=07-09-2011
http://bit.ly/nQovQ9
> -----Original Message-----
> From: [log in to unmask] [mailto:[log in to unmask]]
> Sent: 08 September 2011 13:52
> To: Brew, Chris (STFC,RAL,PPD)
> Cc: [log in to unmask]; [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] CreamCEs keep getting blacklisted by WMS
>
> Hi Chris,
>
> > I'm not sure but certainly the NGI_UK nagios server was submitting
> > jobs against it which were succeeding and from when I was looking in
> > the logs for any evidence of a problem other jobs were coming in and
> > ending up on the queue.
> >
> > [...]
> > >
> > >So is the CE blacklisted only for the submissions done by a specific
> > >WMS (while everything works properly for jobs submitted by other
> WMSes) ?
> >
> > As far as I can tell, yes. That's what makes it so hard to detect and
> > debug - if it wasn't the CERN WMSes used for the Experiment SAM tests
> > and the CMS JobRobot I would not have spotted it until someone
> ticketed me.
>
> There definitely is a problem with your CEs as seen from CERN:
>
> -----------------------------------------------------------------------
> ----
> $ ./chk-svc-cert heplnx206.pp.rl.ac.uk 8443
> socket: Connection timed out
> connect:errno=29
> unable to load certificate
> 23598:error:0906D06C:PEM routines:PEM_read_bio:no start
> line:pem_lib.c:647:
> Expecting: TRUSTED CERTIFICATE
> -----------------------------------------------------------------------
> ----
> $ ./chk-svc-cert heplnx142.pp.rl.ac.uk 8443
> getaddrinfo: Name or service not known
> connect:errno=2
> unable to load certificate
> 23669:error:0906D06C:PEM routines:PEM_read_bio:no start
> line:pem_lib.c:647:
> Expecting: TRUSTED CERTIFICATE
> -----------------------------------------------------------------------
> ----
>
> And a bit later:
>
> -----------------------------------------------------------------------
> ----
> $ ./chk-svc-cert heplnx206.pp.rl.ac.uk 8443
> write:errno=104
> unable to load certificate
> 23749:error:0906D06C:PEM routines:PEM_read_bio:no start
> line:pem_lib.c:647:
> Expecting: TRUSTED CERTIFICATE
> -----------------------------------------------------------------------
> ----
>
> Error 104 == ECONNRESET.
>
> Compare with another CE in your area (lines wrapped for clarity):
>
> -----------------------------------------------------------------------
> ----
> $ chk-svc-cert t2ce06.physics.ox.ac.uk 8443
> depth=2 /C=UK/O=eScienceRoot/OU=Authority/CN=UK e-Science Root verify
> return:1
> depth=1 /C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA verify
> return:1 depth=0
> /C=UK/O=eScience/OU=Oxford/L=OeSC/CN=t2ce06.physics.ox.ac.uk/
> [log in to unmask]
> verify return:1
> DONE
> notBefore=Sep 7 10:22:43 2010 GMT
> notAfter=Oct 7 10:22:43 2011 GMT
> -----------------------------------------------------------------------
> ----
>
> The script is attached.
>
> This looks like a network problem close to your site: traffic to/from
> port
> 8443 has a very bad quality of service or may even be dropped
> altogether once the connection has been established...
|