One of the last things which is broken is that the host used by SAM
for SE/SRM tests hasn't had its certificates updated.
https://gus.fzk.de/ws/ticket_info.php?ticket=36626
g
On Wed, May 21, 2008 at 9:51 AM, Kelsey, DP (David) <[log in to unmask]> wrote:
> Dear all,
>
> Is it possible that many of the unexplained "it suddenly started
> working" events are due to web caches for the UK root CA CRL? In RAL
> PPD, Chris was having big problems with fetch_CRL for the new UK root CA
> CRL where it constantly complained "failed to verify". This then left
> the old CRL in place. The conclusion in the end was that the RAL web
> proxy had an old CRL - once flushed it all worked OK. If this is hitting
> other sites then at some time the cache times out and all of a sudden
> all works.
>
> This is just one of the potential problems which gets muddled up with
> sites which failed to update the CA, sites which updated the CA but
> failed to pull new CRLs, and user and host certs issued by the old UK CA
> which were not affected by the update.
>
> Dave
>
>
> ------------------------------------------------
> Dr David Kelsey
> Particle Physics Department
> Rutherford Appleton Laboratory
> Chilton, DIDCOT, OX11 0QX, UK
>
> e-mail: [log in to unmask]
> Tel: [+44](0)1235 445746 (direct)
> Fax: [+44](0)1235 446733
> ------------------------------------------------
>
>
>
>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]] On Behalf Of Simon George
>> Sent: 20 May 2008 21:57
>> To: [log in to unmask]
>> Subject: Re: New LCG CA release 1.21: breaks site
>>
>> Hi,
>>
>> as far as I am concerned, js magically started passing mid-morning.
>> Neither Duncan nor I are aware of anything we did. We are ok
>> now. More details follow.
>>
>> At mid-morning, the js test began to pass and I saw that the
>> ca test was giving a warning. I realised that although I
>> updated the lcg-CA certs (by rsync of the certs files from
>> my worker node image to each node) the RPM database on each
>> node was not updated. Because the test checks this first, it
>> was fooled into thinking I had not updated. SO I rsynced the
>> RPM db from the worker node image to all the worker nodes too
>> and the ca test started passing.
>>
>> This led me to realise that when the js was failing
>> overnight, the situation on the worker nodes was:
>> crls updated
>> lcg-CA certs updated
>> rpm DB not updated
>>
>> Should this result in js failing? Or was there some other
>> problem that was fixed mid-morning.
>>
>> Cheers,
>> Simon
>>
>> Kelsey, DP (David) wrote:
>> > Graeme, Simon,
>> >
>> > Is it now understood why Durham and RHUL were experiencing problems
>> > and how it was fixed?
>> >
>> > Dave
>> >
>> >
>> > ------------------------------------------------
>> > Dr David Kelsey
>> > Particle Physics Department
>> > Rutherford Appleton Laboratory
>> > Chilton, DIDCOT, OX11 0QX, UK
>> >
>> > e-mail: [log in to unmask]
>> > Tel: [+44](0)1235 445746 (direct)
>> > Fax: [+44](0)1235 446733
>> > ------------------------------------------------
>> >
>> >
>> >
>> >
>> >> -----Original Message-----
>> >> From: Testbed Support for GridPP member institutes
>> >> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
>> >> Sent: 19 May 2008 23:29
>> >> To: [log in to unmask]
>> >> Subject: Re: New LCG CA release 1.21: breaks site
>> >>
>> >> Hi John
>> >>
>> >> That was a RAID failure on their SE - not related.
>> >>
>> >> Having forced a CRL update across the Durham cluster they
>> are still
>> >> failing SAM tests, so we don't seem to be out of the woods yet...
>> >>
>> >> g
>> >>
>> >> On Mon, May 19, 2008 at 11:11 PM, Gordon, JC (John)
>> >> <[log in to unmask]> wrote:
>> >>> Graeme, do we know that this was CA related? Durham were faiiing
>> >>> overnight Sunday too.
>> >>>
>> >>> John
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: Testbed Support for GridPP member institutes
>> >>>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
>> >>>> Sent: 19 May 2008 22:00
>> >>>> To: [log in to unmask]
>> >>>> Subject: Re: New LCG CA release 1.21: breaks site
>> >>>>
>> >>>> On Mon, May 19, 2008 at 8:54 PM, Jensen, J (Jens)
>> >> <[log in to unmask]>
>> >>>> wrote:
>> >>>>> Hi Graeme,
>> >>>>>
>> >>>>> I know for the Moz NSS bug, it is because as part of the SSL
>> >>>>> negotiation, the server (or client, doesn't matter) sends
>> >>>> its trusted
>> >>>>> certificates to the peer saying "look this is my cert" and
>> >>>> the peer says "wot? I thought it looked like this?"
>> >>>>> But OpenSSL and stuff derived from OpenSSL does not work
>> >> like this;
>> >>>>> they may or may not send intermediate certificates in the
>> >>>> negotiation
>> >>>>> but all that matters is that the trust chain can be built,
>> >>>> which of course they can be either way.
>> >>>>> Maybe it's something more obvious. Like CRLs that haven't been
>> >>>>> refreshed when you install the 1.21 release. You folk in
>> >>>> Glasgow have
>> >>>>> probably been Good Eggs(tm) as usual and refreshed your CRLs.
>> >>>> I upgraded one UI first (not our main one) and checked
>> >> that fetch-crl
>> >>>> worked - so that there was nothing basically wrong with the CA
>> >>>> release. Then, after I had upgraded the CE I refreshed
>> the CRLs by
>> >>>> hand. Because of the way our site infrastruture works all
>> >> the other
>> >>>> machines then copy their CRLs from the CE (via a simple
>> >> mirror - no
>> >>>> complicated SSL thingamybobs...).
>> >>>>
>> >>>> I can actually tell when Durham broke from the ATLAS pilot
>> >> submission
>> >>>> logs:
>> >>>>
>> >>>> http://svr017.gla.scotgrid.ac.uk/factory/logs/2008-05-19/ce01.
>> >>>> dur.scotgrid.ac.uk_2119_jobmanager-lcgpbs-q3d/SubmissionLog
>> >>>>
>> >>>> I should say they broke for my submission before I had touched
>> >>>> anything at Glasgow re. the update.
>> >>>>
>> >>>> I now see a very weird effect. I can globus job run from
>> >> one Glasgow
>> >>>> UI to Durham ok, but not from the other...
>> >>>>
>> >>>> g
>> >>>>
>>
>
|