As a test, can people do the following in /var/log/srmv1:
[root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
alifarm18.ct.infn.it
alifarm27.ct.infn.it
alifarm29.ct.infn.it
alifarm32.ct.infn.it
alifarm33.ct.infn.it
alifarm40.ct.infn.it
alifarm42.ct.infn.it
gridfw-ext.cs.tcd.ie
monb002.cern.ch
monb003.cern.ch
sam111.cern.ch
svr031.gla.scotgrid.ac.uk
unknown
w-wn0476.grid.sinica.edu.tw
[root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
73
This shows the number of times this error has cropped up at Edinburgh
today. We never really see SAM failures with this error message though
so for us this behaviour is somewhat OK and "normal".
Graeme, I notice that svr031 pops up a lot. I think this is the Glasgow
UI, right?
Cheers,
Greig
Graeme Stewart wrote, On 09/01/09 13:33:
> Hi Matt
>
> We also suffer from problems here and making sure that our CRLs were
> bang up to date did not cure it (although this is worth doing). When
> we checked with other DPM sites they also seemed to see the same issue
> (grep for the error message in the logs), but there seems to be some
> phasing or timing issue which means that it affects certain
> certificates more often than others. We can go for a week with no SAM
> failures, then get 2 a day for 3 days, then they disappear again.
>
> When we asked the DPM people they said that it was very hard to
> identify what was causing the error - it's a very generic message.
>
> Death to X509...
>
> We have a PPS DPM for ATLAS and it does seem that the second SE does
> not suffer from this problem as much, which hints at a loading issue.
>
> Cheers
>
> Graeme
>
>
> On Fri, Jan 9, 2009 at 1:01 PM, Matt Doidge <[log in to unmask]> wrote:
>
>> Hello, thanks for the reply.
>> The fetch_crl cron runs at a similar interval (every 6 hours) but at 27
>> minutes past the hour- so after the failures. Would increasing their
>> frequency (say to every 4 hours) be a plan to prevent stale CRLs? Although
>> I'd be surprised if things went bad that quickly every day.
>>
>> I'll shunt around the timing of the mysql backups and see if that makes a
>> difference, lets see what happens over the weekend.
>>
>> Have a good weekend all,
>> Matt
>>
>> Greig A. Cowan wrote:
>>
>>> Hi Matt,
>>>
>>> When does fetch-crl run? gSOAP errors like that are often caused by out of
>>> date CRLs.
>>>
>>> Can you change the MySQL backup to a different time to see if it
>>> correlates with the SAM failures?
>>>
>>> Greig
>>>
>>> Matt Doidge wrote, On 09/01/09 11:59:
>>>
>>>> Heya guys, and Happy 2009 to all,
>>>>
>>>> We're regularly failing srm SAM tests at ~6.13 and ~18.13 every day with
>>>> the error message pasted below. Such regular failing sets off the obvious
>>>> alarm bells, and I immediately checked the cron jobs. Both the edg-mkgridmap
>>>> and our mysql backup happen at the time of these failures, but as these are
>>>> 6 hourly cronjobs I would also expect them to interfere with the midnight
>>>> and midday tests. Also the error message doesn't quite fit with what I'd
>>>> expect (last time we saw a similar error message it was caused by network
>>>> problems between the worker nodes/CE and the SE). I'd appreciate any wisdom
>>>> on this matter.
>>>>
>>>> cheers,
>>>> Matt
>>>>
>>>> + lcg-cr --version
>>>> lcg_util-1.6.15
>>>> GFAL-client-1.10.17
>>>> + set +x
>>>>
>>>> + lcg-cr -t 120 -v --vo ops file:/home/samops/.same/SE/testFile.txt -l
>>>> lfn:SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507 -d
>>>> fal-pygrid-30.lancs.ac.uk
>>>> Using grid catalog type: lfc
>>>> Using grid catalog : prod-lfc-shared-central.cern.ch
>>>> Using LFN : /grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>> [BDII] sam-bdii.cern.ch:2170: Warning, no GlueVOInfo information found
>>>> about tag '(null)' and SE 'fal-pygrid-30.lancs.ac.uk'
>>>> SE type: SRMv1
>>>> Using SURL :
>>>> srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/ops/generated/2009-01-09/file33df0c61-861c-4f81-9efa-3c6999a6d6d1
>>>> Alias registered in Catalog:
>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>> Alias registered in Catalog:
>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>> Alias registered in Catalog:
>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>> Alias registered in Catalog:
>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>> Alias registered in Catalog:
>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>> [SE][put] httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1:
>>>> CGSI-gSOAP: Error reading token data header: Connection closed
>>>> lcg_cr: Operation now in progress
>>>> + result=1
>>>> + set +x
>>>>
>>>>
>
>
>
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
|