At Lancaster:
[root@fal-pygrid-30 srmv1]# grep "CGSI-gSOAP: Error reading token data
header: Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
fal-pygrid-17.lancs.ac.uk
monb002.cern.ch
monb003.cern.ch
niels004.tier2.hep.manchester.ac.uk
sam111.cern.ch
[root@fal-pygrid-30 srmv1]# grep "CGSI-gSOAP: Error reading token data
header: Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
209
So we're seeing a lot more the the errors then everyone else, nearly 4
times as many. Hence why this is perhaps affecting our tests regularly?
cheers,
Matt
Sam Skipsey wrote:
> At Glasgow:
>
> On Fri, 9 Jan 2009 13:56:44 +0000, Greig A. Cowan <[log in to unmask]> wrote:
>
>
>> As a test, can people do the following in /var/log/srmv1:
>>
>> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
>>
>
> monb002.cern.ch
> monb003.cern.ch
> sam111.cern.ch
> svr031.gla.scotgrid.ac.uk
>
>
>> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
>>
>
> 68
>
>
>> This shows the number of times this error has cropped up at Edinburgh
>> today. We never really see SAM failures with this error message though
>> so for us this behaviour is somewhat OK and "normal".
>>
>> Graeme, I notice that svr031 pops up a lot. I think this is the Glasgow
>> UI, right?
>>
>>
>
> It's used for rather a lot of things, one of them being our UI, yes.
>
> I note that we've also tried doing what Alessandra's page for the Manchester
> DPM said:
> http://www.gridpp.ac.uk/wiki/Manchester_DPM
> (basically, have the pool nodes trust each other as well as the head and
> themselves)
> and that doesn't seem to have changed anything.
>
>
> Sam
>
>
>
>> Cheers,
>> Greig
>>
>> Graeme Stewart wrote, On 09/01/09 13:33:
>>
>>> Hi Matt
>>>
>>> We also suffer from problems here and making sure that our CRLs were
>>> bang up to date did not cure it (although this is worth doing). When
>>> we checked with other DPM sites they also seemed to see the same issue
>>> (grep for the error message in the logs), but there seems to be some
>>> phasing or timing issue which means that it affects certain
>>> certificates more often than others. We can go for a week with no SAM
>>> failures, then get 2 a day for 3 days, then they disappear again.
>>>
>>> When we asked the DPM people they said that it was very hard to
>>> identify what was causing the error - it's a very generic message.
>>>
>>> Death to X509...
>>>
>>> We have a PPS DPM for ATLAS and it does seem that the second SE does
>>> not suffer from this problem as much, which hints at a loading issue.
>>>
>>> Cheers
>>>
>>> Graeme
>>>
>>>
>>> On Fri, Jan 9, 2009 at 1:01 PM, Matt Doidge <[log in to unmask]> wrote:
>>>
>>>
>>>> Hello, thanks for the reply.
>>>> The fetch_crl cron runs at a similar interval (every 6 hours) but at 27
>>>> minutes past the hour- so after the failures. Would increasing their
>>>> frequency (say to every 4 hours) be a plan to prevent stale CRLs? Although
>>>> I'd be surprised if things went bad that quickly every day.
>>>>
>>>> I'll shunt around the timing of the mysql backups and see if that makes a
>>>> difference, lets see what happens over the weekend.
>>>>
>>>> Have a good weekend all,
>>>> Matt
>>>>
>>>> Greig A. Cowan wrote:
>>>>
>>>>
>>>>> Hi Matt,
>>>>>
>>>>> When does fetch-crl run? gSOAP errors like that are often caused by out of
>>>>> date CRLs.
>>>>>
>>>>> Can you change the MySQL backup to a different time to see if it
>>>>> correlates with the SAM failures?
>>>>>
>>>>> Greig
>>>>>
>>>>> Matt Doidge wrote, On 09/01/09 11:59:
>>>>>
>>>>>
>>>>>> Heya guys, and Happy 2009 to all,
>>>>>>
>>>>>> We're regularly failing srm SAM tests at ~6.13 and ~18.13 every day with
>>>>>> the error message pasted below. Such regular failing sets off the obvious
>>>>>> alarm bells, and I immediately checked the cron jobs. Both the
>>>>>>
> edg-mkgridmap
>
>>>>>> and our mysql backup happen at the time of these failures, but as
>>>>>>
> these are
>
>>>>>> 6 hourly cronjobs I would also expect them to interfere with the midnight
>>>>>> and midday tests. Also the error message doesn't quite fit with what I'd
>>>>>> expect (last time we saw a similar error message it was caused by network
>>>>>> problems between the worker nodes/CE and the SE). I'd appreciate any
>>>>>>
> wisdom
>
>>>>>> on this matter.
>>>>>>
>>>>>> cheers,
>>>>>> Matt
>>>>>>
>>>>>> + lcg-cr --version
>>>>>> lcg_util-1.6.15
>>>>>> GFAL-client-1.10.17
>>>>>> + set +x
>>>>>>
>>>>>> + lcg-cr -t 120 -v --vo ops file:/home/samops/.same/SE/testFile.txt -l
>>>>>> lfn:SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507 -d
>>>>>> fal-pygrid-30.lancs.ac.uk
>>>>>> Using grid catalog type: lfc
>>>>>> Using grid catalog : prod-lfc-shared-central.cern.ch
>>>>>> Using LFN : /grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> [BDII] sam-bdii.cern.ch:2170: Warning, no GlueVOInfo information found
>>>>>> about tag '(null)' and SE 'fal-pygrid-30.lancs.ac.uk'
>>>>>> SE type: SRMv1
>>>>>> Using SURL :
>>>>>>
>>>>>>
> srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/ops/generated/2009-01-09/file33df0c61-861c-4f81-9efa-3c6999a6d6d1
>
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> [SE][put] httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1:
>>>>>> CGSI-gSOAP: Error reading token data header: Connection closed
>>>>>> lcg_cr: Operation now in progress
>>>>>> + result=1
>>>>>> + set +x
>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>
>
|