Also a good point. I didn't think the cache would cache for remote users, only for
those inside (like Chris), but other sites may well have their own web caches. I
remember having had that problem before
Cheers
--jens
-----Original Message-----
From: Testbed Support for GridPP member institutes on behalf of Kelsey, DP (David)
Sent: Wed 21/05/2008 09:51
To: [log in to unmask]
Subject: Re: New LCG CA release 1.21: breaks site
Dear all,
Is it possible that many of the unexplained "it suddenly started
working" events are due to web caches for the UK root CA CRL? In RAL
PPD, Chris was having big problems with fetch_CRL for the new UK root CA
CRL where it constantly complained "failed to verify". This then left
the old CRL in place. The conclusion in the end was that the RAL web
proxy had an old CRL - once flushed it all worked OK. If this is hitting
other sites then at some time the cache times out and all of a sudden
all works.
This is just one of the potential problems which gets muddled up with
sites which failed to update the CA, sites which updated the CA but
failed to pull new CRLs, and user and host certs issued by the old UK CA
which were not affected by the update.
Dave
------------------------------------------------
Dr David Kelsey
Particle Physics Department
Rutherford Appleton Laboratory
Chilton, DIDCOT, OX11 0QX, UK
e-mail: [log in to unmask]
Tel: [+44](0)1235 445746 (direct)
Fax: [+44](0)1235 446733
------------------------------------------------
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]] On Behalf Of Simon George
> Sent: 20 May 2008 21:57
> To: [log in to unmask]
> Subject: Re: New LCG CA release 1.21: breaks site
>
> Hi,
>
> as far as I am concerned, js magically started passing mid-morning.
> Neither Duncan nor I are aware of anything we did. We are ok
> now. More details follow.
>
> At mid-morning, the js test began to pass and I saw that the
> ca test was giving a warning. I realised that although I
> updated the lcg-CA certs (by rsync of the certs files from
> my worker node image to each node) the RPM database on each
> node was not updated. Because the test checks this first, it
> was fooled into thinking I had not updated. SO I rsynced the
> RPM db from the worker node image to all the worker nodes too
> and the ca test started passing.
>
> This led me to realise that when the js was failing
> overnight, the situation on the worker nodes was:
> crls updated
> lcg-CA certs updated
> rpm DB not updated
>
> Should this result in js failing? Or was there some other
> problem that was fixed mid-morning.
>
> Cheers,
> Simon
>
> Kelsey, DP (David) wrote:
> > Graeme, Simon,
> >
> > Is it now understood why Durham and RHUL were experiencing problems
> > and how it was fixed?
> >
> > Dave
> >
> >
> > ------------------------------------------------
> > Dr David Kelsey
> > Particle Physics Department
> > Rutherford Appleton Laboratory
> > Chilton, DIDCOT, OX11 0QX, UK
> >
> > e-mail: [log in to unmask]
> > Tel: [+44](0)1235 445746 (direct)
> > Fax: [+44](0)1235 446733
> > ------------------------------------------------
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Testbed Support for GridPP member institutes
> >> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
> >> Sent: 19 May 2008 23:29
> >> To: [log in to unmask]
> >> Subject: Re: New LCG CA release 1.21: breaks site
> >>
> >> Hi John
> >>
> >> That was a RAID failure on their SE - not related.
> >>
> >> Having forced a CRL update across the Durham cluster they
> are still
> >> failing SAM tests, so we don't seem to be out of the woods yet...
> >>
> >> g
> >>
> >> On Mon, May 19, 2008 at 11:11 PM, Gordon, JC (John)
> >> <[log in to unmask]> wrote:
> >>> Graeme, do we know that this was CA related? Durham were faiiing
> >>> overnight Sunday too.
> >>>
> >>> John
> >>>
> >>>> -----Original Message-----
> >>>> From: Testbed Support for GridPP member institutes
> >>>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
> >>>> Sent: 19 May 2008 22:00
> >>>> To: [log in to unmask]
> >>>> Subject: Re: New LCG CA release 1.21: breaks site
> >>>>
> >>>> On Mon, May 19, 2008 at 8:54 PM, Jensen, J (Jens)
> >> <[log in to unmask]>
> >>>> wrote:
> >>>>> Hi Graeme,
> >>>>>
> >>>>> I know for the Moz NSS bug, it is because as part of the SSL
> >>>>> negotiation, the server (or client, doesn't matter) sends
> >>>> its trusted
> >>>>> certificates to the peer saying "look this is my cert" and
> >>>> the peer says "wot? I thought it looked like this?"
> >>>>> But OpenSSL and stuff derived from OpenSSL does not work
> >> like this;
> >>>>> they may or may not send intermediate certificates in the
> >>>> negotiation
> >>>>> but all that matters is that the trust chain can be built,
> >>>> which of course they can be either way.
> >>>>> Maybe it's something more obvious. Like CRLs that haven't been
> >>>>> refreshed when you install the 1.21 release. You folk in
> >>>> Glasgow have
> >>>>> probably been Good Eggs(tm) as usual and refreshed your CRLs.
> >>>> I upgraded one UI first (not our main one) and checked
> >> that fetch-crl
> >>>> worked - so that there was nothing basically wrong with the CA
> >>>> release. Then, after I had upgraded the CE I refreshed
> the CRLs by
> >>>> hand. Because of the way our site infrastruture works all
> >> the other
> >>>> machines then copy their CRLs from the CE (via a simple
> >> mirror - no
> >>>> complicated SSL thingamybobs...).
> >>>>
> >>>> I can actually tell when Durham broke from the ATLAS pilot
> >> submission
> >>>> logs:
> >>>>
> >>>> http://svr017.gla.scotgrid.ac.uk/factory/logs/2008-05-19/ce01.
> >>>> dur.scotgrid.ac.uk_2119_jobmanager-lcgpbs-q3d/SubmissionLog
> >>>>
> >>>> I should say they broke for my submission before I had touched
> >>>> anything at Glasgow re. the update.
> >>>>
> >>>> I now see a very weird effect. I can globus job run from
> >> one Glasgow
> >>>> UI to Durham ok, but not from the other...
> >>>>
> >>>> g
> >>>>
>
|