Hi,
as far as I am concerned, js magically started passing mid-morning.
Neither Duncan nor I are aware of anything we did. We are ok now. More
details follow.
At mid-morning, the js test began to pass and I saw that the ca test was
giving a warning. I realised that although I updated the lcg-CA certs
(by rsync of the certs files from my worker node image to each node) the
RPM database on each node was not updated. Because the test checks this
first, it was fooled into thinking I had not updated. SO I rsynced the
RPM db from the worker node image to all the worker nodes too and the ca
test started passing.
This led me to realise that when the js was failing overnight, the
situation on the worker nodes was:
crls updated
lcg-CA certs updated
rpm DB not updated
Should this result in js failing? Or was there some other problem that
was fixed mid-morning.
Cheers,
Simon
Kelsey, DP (David) wrote:
> Graeme, Simon,
>
> Is it now understood why Durham and RHUL were experiencing problems and
> how it was fixed?
>
> Dave
>
>
> ------------------------------------------------
> Dr David Kelsey
> Particle Physics Department
> Rutherford Appleton Laboratory
> Chilton, DIDCOT, OX11 0QX, UK
>
> e-mail: [log in to unmask]
> Tel: [+44](0)1235 445746 (direct)
> Fax: [+44](0)1235 446733
> ------------------------------------------------
>
>
>
>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
>> Sent: 19 May 2008 23:29
>> To: [log in to unmask]
>> Subject: Re: New LCG CA release 1.21: breaks site
>>
>> Hi John
>>
>> That was a RAID failure on their SE - not related.
>>
>> Having forced a CRL update across the Durham cluster they are
>> still failing SAM tests, so we don't seem to be out of the
>> woods yet...
>>
>> g
>>
>> On Mon, May 19, 2008 at 11:11 PM, Gordon, JC (John)
>> <[log in to unmask]> wrote:
>>> Graeme, do we know that this was CA related? Durham were faiiing
>>> overnight Sunday too.
>>>
>>> John
>>>
>>>> -----Original Message-----
>>>> From: Testbed Support for GridPP member institutes
>>>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
>>>> Sent: 19 May 2008 22:00
>>>> To: [log in to unmask]
>>>> Subject: Re: New LCG CA release 1.21: breaks site
>>>>
>>>> On Mon, May 19, 2008 at 8:54 PM, Jensen, J (Jens)
>> <[log in to unmask]>
>>>> wrote:
>>>>> Hi Graeme,
>>>>>
>>>>> I know for the Moz NSS bug, it is because as part of the SSL
>>>>> negotiation, the server (or client, doesn't matter) sends
>>>> its trusted
>>>>> certificates to the peer saying "look this is my cert" and
>>>> the peer says "wot? I thought it looked like this?"
>>>>> But OpenSSL and stuff derived from OpenSSL does not work
>> like this;
>>>>> they may or may not send intermediate certificates in the
>>>> negotiation
>>>>> but all that matters is that the trust chain can be built,
>>>> which of course they can be either way.
>>>>> Maybe it's something more obvious. Like CRLs that haven't been
>>>>> refreshed when you install the 1.21 release. You folk in
>>>> Glasgow have
>>>>> probably been Good Eggs(tm) as usual and refreshed your CRLs.
>>>> I upgraded one UI first (not our main one) and checked
>> that fetch-crl
>>>> worked - so that there was nothing basically wrong with the CA
>>>> release. Then, after I had upgraded the CE I refreshed the CRLs by
>>>> hand. Because of the way our site infrastruture works all
>> the other
>>>> machines then copy their CRLs from the CE (via a simple
>> mirror - no
>>>> complicated SSL thingamybobs...).
>>>>
>>>> I can actually tell when Durham broke from the ATLAS pilot
>> submission
>>>> logs:
>>>>
>>>> http://svr017.gla.scotgrid.ac.uk/factory/logs/2008-05-19/ce01.
>>>> dur.scotgrid.ac.uk_2119_jobmanager-lcgpbs-q3d/SubmissionLog
>>>>
>>>> I should say they broke for my submission before I had touched
>>>> anything at Glasgow re. the update.
>>>>
>>>> I now see a very weird effect. I can globus job run from
>> one Glasgow
>>>> UI to Durham ok, but not from the other...
>>>>
>>>> g
>>>>
|