Just picking up the thread here, we're seeing these errors (that John is grepping
for below) as well in the CASTOR SRMs (don't know if we're seeing them in _all_ SRMs,
could check). On a random Atlas SRM I'm seeing 36 in the last 9 hours.
CASTOR is using pretty much the same CGSI-gSOAP plugin that DPM is using.
Secondly, as I read Matt's original mail it also appeared that information
dropped out of the information system?
--jens
-----Original Message-----
From: GRIDPP2: Deployment and support of SRM and local storage management on behalf of John Bland
Sent: Fri 09/01/2009 14:12
To: [log in to unmask]
Subject: Re: Regular SRM SAM test failures
For Liverpool:
[root@hepgrid11 srmv1]# grep "CGSI-gSOAP: Error reading token data
header: Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
monb002.cern.ch
monb003.cern.ch
niels004.tier2.hep.manchester.ac.uk
sam111.cern.ch
[root@hepgrid11 srmv1]# grep "CGSI-gSOAP: Error reading token data
header: Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
66
Similarly no SAM errors with this sort of level. Still, I'd rather they
didn't happen.
John
Greig A. Cowan wrote:
> As a test, can people do the following in /var/log/srmv1:
>
> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
> Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
> alifarm18.ct.infn.it
> alifarm27.ct.infn.it
> alifarm29.ct.infn.it
> alifarm32.ct.infn.it
> alifarm33.ct.infn.it
> alifarm40.ct.infn.it
> alifarm42.ct.infn.it
> gridfw-ext.cs.tcd.ie
> monb002.cern.ch
> monb003.cern.ch
> sam111.cern.ch
> svr031.gla.scotgrid.ac.uk
> unknown
> w-wn0476.grid.sinica.edu.tw
>
> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
> Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
> 73
>
> This shows the number of times this error has cropped up at Edinburgh
> today. We never really see SAM failures with this error message though
> so for us this behaviour is somewhat OK and "normal".
>
> Graeme, I notice that svr031 pops up a lot. I think this is the Glasgow
> UI, right?
>
> Cheers,
> Greig
>
> Graeme Stewart wrote, On 09/01/09 13:33:
>> Hi Matt
>>
>> We also suffer from problems here and making sure that our CRLs were
>> bang up to date did not cure it (although this is worth doing). When
>> we checked with other DPM sites they also seemed to see the same issue
>> (grep for the error message in the logs), but there seems to be some
>> phasing or timing issue which means that it affects certain
>> certificates more often than others. We can go for a week with no SAM
>> failures, then get 2 a day for 3 days, then they disappear again.
>>
>> When we asked the DPM people they said that it was very hard to
>> identify what was causing the error - it's a very generic message.
>>
>> Death to X509...
>>
>> We have a PPS DPM for ATLAS and it does seem that the second SE does
>> not suffer from this problem as much, which hints at a loading issue.
>>
>> Cheers
>>
>> Graeme
>>
>>
>> On Fri, Jan 9, 2009 at 1:01 PM, Matt Doidge <[log in to unmask]>
>> wrote:
>>
>>> Hello, thanks for the reply.
>>> The fetch_crl cron runs at a similar interval (every 6 hours) but at 27
>>> minutes past the hour- so after the failures. Would increasing their
>>> frequency (say to every 4 hours) be a plan to prevent stale CRLs?
>>> Although
>>> I'd be surprised if things went bad that quickly every day.
>>>
>>> I'll shunt around the timing of the mysql backups and see if that
>>> makes a
>>> difference, lets see what happens over the weekend.
>>>
>>> Have a good weekend all,
>>> Matt
>>>
>>> Greig A. Cowan wrote:
>>>
>>>> Hi Matt,
>>>>
>>>> When does fetch-crl run? gSOAP errors like that are often caused by
>>>> out of
>>>> date CRLs.
>>>>
>>>> Can you change the MySQL backup to a different time to see if it
>>>> correlates with the SAM failures?
>>>>
>>>> Greig
>>>>
>>>> Matt Doidge wrote, On 09/01/09 11:59:
>>>>
>>>>> Heya guys, and Happy 2009 to all,
>>>>>
>>>>> We're regularly failing srm SAM tests at ~6.13 and ~18.13 every day
>>>>> with
>>>>> the error message pasted below. Such regular failing sets off the
>>>>> obvious
>>>>> alarm bells, and I immediately checked the cron jobs. Both the
>>>>> edg-mkgridmap
>>>>> and our mysql backup happen at the time of these failures, but as
>>>>> these are
>>>>> 6 hourly cronjobs I would also expect them to interfere with the
>>>>> midnight
>>>>> and midday tests. Also the error message doesn't quite fit with
>>>>> what I'd
>>>>> expect (last time we saw a similar error message it was caused by
>>>>> network
>>>>> problems between the worker nodes/CE and the SE). I'd appreciate
>>>>> any wisdom
>>>>> on this matter.
>>>>>
>>>>> cheers,
>>>>> Matt
>>>>>
>>>>> + lcg-cr --version
>>>>> lcg_util-1.6.15
>>>>> GFAL-client-1.10.17
>>>>> + set +x
>>>>>
>>>>> + lcg-cr -t 120 -v --vo ops file:/home/samops/.same/SE/testFile.txt -l
>>>>> lfn:SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507 -d
>>>>> fal-pygrid-30.lancs.ac.uk
>>>>> Using grid catalog type: lfc
>>>>> Using grid catalog : prod-lfc-shared-central.cern.ch
>>>>> Using LFN :
>>>>> /grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>> [BDII] sam-bdii.cern.ch:2170: Warning, no GlueVOInfo information found
>>>>> about tag '(null)' and SE 'fal-pygrid-30.lancs.ac.uk'
>>>>> SE type: SRMv1
>>>>> Using SURL :
>>>>> srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/ops/generated/2009-01-09/file33df0c61-861c-4f81-9efa-3c6999a6d6d1
>>>>>
>>>>> Alias registered in Catalog:
>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>> Alias registered in Catalog:
>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>> Alias registered in Catalog:
>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>> Alias registered in Catalog:
>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>> Alias registered in Catalog:
>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>> [SE][put] httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1:
>>>>> CGSI-gSOAP: Error reading token data header: Connection closed
>>>>> lcg_cr: Operation now in progress
>>>>> + result=1
>>>>> + set +x
>>>>>
>>>>>
>>
>>
>>
>>
>
--
Dr John Bland, Systems Administrator
Room 220, Oliver Lodge
Particle Physics Group, University of Liverpool
Mail: [log in to unmask]
Tel : 0151 794 2911
"I canna change the laws of physics, Captain!"
--
Scanned by iCritical.
|