Print

Print


Hi,

I'm unsure exactly how/why Durham was fixed... it just seemed to 
happen.  This is what we tried...

With the site upgraded to lcg-CA 1.21 the site was broken.  About 
1:30-2pm we downgraded back to lcg-CA 1.20, and then upgraded again to 
1.21.  We also tried to clear out all of /etc/grid-security/certificates 
and then reinstall (yum remove lcg-CA ca_* ; rm -rf 
/etc/grid-security/certificates/ ; yum install lcg-CA) and then run 
fetch-crl.  We only did this on the CE.

Its possible that a combination of the above fixed the site.  However we 
had tried that during the morning and that didn't fix anything so I'm 
afraid its inconclusive.   Maybe even an external upgrade to a ui or 
site fixed it???

Thanks,
Phil



Kelsey, DP (David) wrote:
> Graeme, Simon,
>
> Is it now understood why Durham and RHUL were experiencing problems and
> how it was fixed?
>
> Dave
>
>
> ------------------------------------------------
> Dr David Kelsey
> Particle Physics Department
> Rutherford Appleton Laboratory
> Chilton, DIDCOT, OX11 0QX, UK
>
> e-mail: [log in to unmask]
> Tel: [+44](0)1235 445746 (direct)
> Fax: [+44](0)1235 446733
> ------------------------------------------------
>
>
>  
>
>   
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes 
>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
>> Sent: 19 May 2008 23:29
>> To: [log in to unmask]
>> Subject: Re: New LCG CA release 1.21: breaks site
>>
>> Hi John
>>
>> That was a RAID failure on their SE - not related.
>>
>> Having forced a CRL update across the Durham cluster they are 
>> still failing SAM tests, so we don't seem to be out of the 
>> woods yet...
>>
>> g
>>
>> On Mon, May 19, 2008 at 11:11 PM, Gordon, JC (John) 
>> <[log in to unmask]> wrote:
>>     
>>> Graeme, do we know that this was CA related? Durham were faiiing 
>>> overnight Sunday too.
>>>
>>> John
>>>
>>>       
>>>> -----Original Message-----
>>>> From: Testbed Support for GridPP member institutes 
>>>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
>>>> Sent: 19 May 2008 22:00
>>>> To: [log in to unmask]
>>>> Subject: Re: New LCG CA release 1.21: breaks site
>>>>
>>>> On Mon, May 19, 2008 at 8:54 PM, Jensen, J (Jens) 
>>>>         
>> <[log in to unmask]> 
>>     
>>>> wrote:
>>>>         
>>>>> Hi Graeme,
>>>>>
>>>>> I know for the Moz NSS bug, it is because as part of the SSL 
>>>>> negotiation, the server (or client, doesn't matter) sends
>>>>>           
>>>> its trusted
>>>>         
>>>>> certificates to the peer saying "look this is my cert" and
>>>>>           
>>>> the peer says "wot? I thought it looked like this?"
>>>>         
>>>>> But OpenSSL and stuff derived from OpenSSL does not work 
>>>>>           
>> like this; 
>>     
>>>>> they may or may not send intermediate certificates in the
>>>>>           
>>>> negotiation
>>>>         
>>>>> but all that matters is that the trust chain can be built,
>>>>>           
>>>> which of course they can be either way.
>>>>         
>>>>> Maybe it's something more obvious.  Like CRLs that haven't been 
>>>>> refreshed when you install the 1.21 release.  You folk in
>>>>>           
>>>> Glasgow have
>>>>         
>>>>> probably been Good Eggs(tm) as usual and refreshed your CRLs.
>>>>>           
>>>> I upgraded one UI first (not our main one) and checked 
>>>>         
>> that fetch-crl 
>>     
>>>> worked - so that there was nothing basically wrong with the CA 
>>>> release. Then, after I had upgraded the CE I refreshed the CRLs by 
>>>> hand. Because of the way our site infrastruture works all 
>>>>         
>> the other 
>>     
>>>> machines then copy their CRLs from the CE (via a simple 
>>>>         
>> mirror - no 
>>     
>>>> complicated SSL thingamybobs...).
>>>>
>>>> I can actually tell when Durham broke from the ATLAS pilot 
>>>>         
>> submission 
>>     
>>>> logs:
>>>>
>>>> http://svr017.gla.scotgrid.ac.uk/factory/logs/2008-05-19/ce01.
>>>> dur.scotgrid.ac.uk_2119_jobmanager-lcgpbs-q3d/SubmissionLog
>>>>
>>>> I should say they broke for my submission before I had touched 
>>>> anything at Glasgow re. the update.
>>>>
>>>> I now see a very weird effect. I can globus job run from 
>>>>         
>> one Glasgow 
>>     
>>>> UI to Durham ok, but not from the other...
>>>>
>>>> g
>>>>
>>>>