For Durham:
[root@se01 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
monb002.cern.ch
monb003.cern.ch
sam111.cern.ch
svr031.gla.scotgrid.ac.uk
[root@se01 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
68
We have seen random SAM test failures with this problem, though I have not
been able to attribute it to excessive load... Ganglia looks ok for instance.
Phil
On Fri, 9 Jan 2009, John Bland wrote:
> For Liverpool:
>
> [root@hepgrid11 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
> Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
> monb002.cern.ch
> monb003.cern.ch
> niels004.tier2.hep.manchester.ac.uk
> sam111.cern.ch
> [root@hepgrid11 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
> Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
> 66
>
> Similarly no SAM errors with this sort of level. Still, I'd rather they
> didn't happen.
>
> John
>
> Greig A. Cowan wrote:
>> As a test, can people do the following in /var/log/srmv1:
>>
>> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
>> alifarm18.ct.infn.it
>> alifarm27.ct.infn.it
>> alifarm29.ct.infn.it
>> alifarm32.ct.infn.it
>> alifarm33.ct.infn.it
>> alifarm40.ct.infn.it
>> alifarm42.ct.infn.it
>> gridfw-ext.cs.tcd.ie
>> monb002.cern.ch
>> monb003.cern.ch
>> sam111.cern.ch
>> svr031.gla.scotgrid.ac.uk
>> unknown
>> w-wn0476.grid.sinica.edu.tw
>>
>> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
>> 73
>>
>> This shows the number of times this error has cropped up at Edinburgh
>> today. We never really see SAM failures with this error message though so
>> for us this behaviour is somewhat OK and "normal".
>>
>> Graeme, I notice that svr031 pops up a lot. I think this is the Glasgow UI,
>> right?
>>
>> Cheers,
>> Greig
>>
>> Graeme Stewart wrote, On 09/01/09 13:33:
>>> Hi Matt
>>>
>>> We also suffer from problems here and making sure that our CRLs were
>>> bang up to date did not cure it (although this is worth doing). When
>>> we checked with other DPM sites they also seemed to see the same issue
>>> (grep for the error message in the logs), but there seems to be some
>>> phasing or timing issue which means that it affects certain
>>> certificates more often than others. We can go for a week with no SAM
>>> failures, then get 2 a day for 3 days, then they disappear again.
>>>
>>> When we asked the DPM people they said that it was very hard to
>>> identify what was causing the error - it's a very generic message.
>>>
>>> Death to X509...
>>>
>>> We have a PPS DPM for ATLAS and it does seem that the second SE does
>>> not suffer from this problem as much, which hints at a loading issue.
>>>
>>> Cheers
>>>
>>> Graeme
>>>
>>>
>>> On Fri, Jan 9, 2009 at 1:01 PM, Matt Doidge <[log in to unmask]> wrote:
>>>
>>>> Hello, thanks for the reply.
>>>> The fetch_crl cron runs at a similar interval (every 6 hours) but at 27
>>>> minutes past the hour- so after the failures. Would increasing their
>>>> frequency (say to every 4 hours) be a plan to prevent stale CRLs?
>>>> Although
>>>> I'd be surprised if things went bad that quickly every day.
>>>>
>>>> I'll shunt around the timing of the mysql backups and see if that makes a
>>>> difference, lets see what happens over the weekend.
>>>>
>>>> Have a good weekend all,
>>>> Matt
>>>>
>>>> Greig A. Cowan wrote:
>>>>
>>>>> Hi Matt,
>>>>>
>>>>> When does fetch-crl run? gSOAP errors like that are often caused by out
>>>>> of
>>>>> date CRLs.
>>>>>
>>>>> Can you change the MySQL backup to a different time to see if it
>>>>> correlates with the SAM failures?
>>>>>
>>>>> Greig
>>>>>
>>>>> Matt Doidge wrote, On 09/01/09 11:59:
>>>>>
>>>>>> Heya guys, and Happy 2009 to all,
>>>>>>
>>>>>> We're regularly failing srm SAM tests at ~6.13 and ~18.13 every day
>>>>>> with
>>>>>> the error message pasted below. Such regular failing sets off the
>>>>>> obvious
>>>>>> alarm bells, and I immediately checked the cron jobs. Both the
>>>>>> edg-mkgridmap
>>>>>> and our mysql backup happen at the time of these failures, but as these
>>>>>> are
>>>>>> 6 hourly cronjobs I would also expect them to interfere with the
>>>>>> midnight
>>>>>> and midday tests. Also the error message doesn't quite fit with what
>>>>>> I'd
>>>>>> expect (last time we saw a similar error message it was caused by
>>>>>> network
>>>>>> problems between the worker nodes/CE and the SE). I'd appreciate any
>>>>>> wisdom
>>>>>> on this matter.
>>>>>>
>>>>>> cheers,
>>>>>> Matt
>>>>>>
>>>>>> + lcg-cr --version
>>>>>> lcg_util-1.6.15
>>>>>> GFAL-client-1.10.17
>>>>>> + set +x
>>>>>>
>>>>>> + lcg-cr -t 120 -v --vo ops file:/home/samops/.same/SE/testFile.txt -l
>>>>>> lfn:SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507 -d
>>>>>> fal-pygrid-30.lancs.ac.uk
>>>>>> Using grid catalog type: lfc
>>>>>> Using grid catalog : prod-lfc-shared-central.cern.ch
>>>>>> Using LFN :
>>>>>> /grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> [BDII] sam-bdii.cern.ch:2170: Warning, no GlueVOInfo information found
>>>>>> about tag '(null)' and SE 'fal-pygrid-30.lancs.ac.uk'
>>>>>> SE type: SRMv1
>>>>>> Using SURL :
>>>>>> srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/ops/generated/2009-01-09/file33df0c61-861c-4f81-9efa-3c6999a6d6d1
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> Alias registered in Catalog:
>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>> [SE][put] httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1:
>>>>>> CGSI-gSOAP: Error reading token data header: Connection closed
>>>>>> lcg_cr: Operation now in progress
>>>>>> + result=1
>>>>>> + set +x
>>>>>>
>>>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
--
---
Phil Roffe - [log in to unmask]
IPPP, Department of Physics, Durham University,
Science Laboratories, South Road, Durham, DH1 3LE
Direct Dial: +44 (0)191 3343704
Office: +44 (0)191 334 3811
|