... on further inspection we did have a short period of high load this
morning (see
ganglia plot attached), just before the SAM failure at 7:48am.
Frustratingly we failed several tests immediately afterwards (until
8:35ish, all SE related). Maybe it is load related - but why does it
appear that the high load broke our SE for approx 1 hour - then fixed
itself?
One further thing to note, our new SE head node is a Virtual Machine - the
host machine itself has only a light load so I don't believe there were
any outside factors, and disk servers are real machines. In theory (at
least looking at ganglia to give an indication of load over the last
year), the SE should
work fine as a VM. In practice maybe this is not the case... particularly
given the way the SEs were hit during the recent Atlas tests. If
necessary we can devirtualise the SE... but at the moment we seem to be
experiencing the same issues as other (non-virtualised) SEs.
Phil
On Fri, 9 Jan 2009, Philip Roffe wrote:
> For Durham:
>
> [root@se01 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
> Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
> monb002.cern.ch
> monb003.cern.ch
> sam111.cern.ch
> svr031.gla.scotgrid.ac.uk
>
> [root@se01 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
> Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
> 68
>
>
> We have seen random SAM test failures with this problem, though I have not
> been able to attribute it to excessive load... Ganglia looks ok for instance.
>
> Phil
>
> On Fri, 9 Jan 2009, John Bland wrote:
>
>> For Liverpool:
>>
>> [root@hepgrid11 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
>> monb002.cern.ch
>> monb003.cern.ch
>> niels004.tier2.hep.manchester.ac.uk
>> sam111.cern.ch
>> [root@hepgrid11 srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
>> 66
>>
>> Similarly no SAM errors with this sort of level. Still, I'd rather they
>> didn't happen.
>>
>> John
>>
>> Greig A. Cowan wrote:
>>> As a test, can people do the following in /var/log/srmv1:
>>>
>>> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|sort -u
>>> alifarm18.ct.infn.it
>>> alifarm27.ct.infn.it
>>> alifarm29.ct.infn.it
>>> alifarm32.ct.infn.it
>>> alifarm33.ct.infn.it
>>> alifarm40.ct.infn.it
>>> alifarm42.ct.infn.it
>>> gridfw-ext.cs.tcd.ie
>>> monb002.cern.ch
>>> monb003.cern.ch
>>> sam111.cern.ch
>>> svr031.gla.scotgrid.ac.uk
>>> unknown
>>> w-wn0476.grid.sinica.edu.tw
>>>
>>> [root@srm srmv1]# grep "CGSI-gSOAP: Error reading token data header:
>>> Connection closed" log|cut -d\( -f2|cut -d\) -f1|wc -l
>>> 73
>>>
>>> This shows the number of times this error has cropped up at Edinburgh
>>> today. We never really see SAM failures with this error message though so
>>> for us this behaviour is somewhat OK and "normal".
>>>
>>> Graeme, I notice that svr031 pops up a lot. I think this is the Glasgow
>>> UI, right?
>>>
>>> Cheers,
>>> Greig
>>>
>>> Graeme Stewart wrote, On 09/01/09 13:33:
>>>> Hi Matt
>>>>
>>>> We also suffer from problems here and making sure that our CRLs were
>>>> bang up to date did not cure it (although this is worth doing). When
>>>> we checked with other DPM sites they also seemed to see the same issue
>>>> (grep for the error message in the logs), but there seems to be some
>>>> phasing or timing issue which means that it affects certain
>>>> certificates more often than others. We can go for a week with no SAM
>>>> failures, then get 2 a day for 3 days, then they disappear again.
>>>>
>>>> When we asked the DPM people they said that it was very hard to
>>>> identify what was causing the error - it's a very generic message.
>>>>
>>>> Death to X509...
>>>>
>>>> We have a PPS DPM for ATLAS and it does seem that the second SE does
>>>> not suffer from this problem as much, which hints at a loading issue.
>>>>
>>>> Cheers
>>>>
>>>> Graeme
>>>>
>>>>
>>>> On Fri, Jan 9, 2009 at 1:01 PM, Matt Doidge <[log in to unmask]>
>>>> wrote:
>>>>
>>>>> Hello, thanks for the reply.
>>>>> The fetch_crl cron runs at a similar interval (every 6 hours) but at 27
>>>>> minutes past the hour- so after the failures. Would increasing their
>>>>> frequency (say to every 4 hours) be a plan to prevent stale CRLs?
>>>>> Although
>>>>> I'd be surprised if things went bad that quickly every day.
>>>>>
>>>>> I'll shunt around the timing of the mysql backups and see if that makes
>>>>> a
>>>>> difference, lets see what happens over the weekend.
>>>>>
>>>>> Have a good weekend all,
>>>>> Matt
>>>>>
>>>>> Greig A. Cowan wrote:
>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> When does fetch-crl run? gSOAP errors like that are often caused by out
>>>>>> of
>>>>>> date CRLs.
>>>>>>
>>>>>> Can you change the MySQL backup to a different time to see if it
>>>>>> correlates with the SAM failures?
>>>>>>
>>>>>> Greig
>>>>>>
>>>>>> Matt Doidge wrote, On 09/01/09 11:59:
>>>>>>
>>>>>>> Heya guys, and Happy 2009 to all,
>>>>>>>
>>>>>>> We're regularly failing srm SAM tests at ~6.13 and ~18.13 every day
>>>>>>> with
>>>>>>> the error message pasted below. Such regular failing sets off the
>>>>>>> obvious
>>>>>>> alarm bells, and I immediately checked the cron jobs. Both the
>>>>>>> edg-mkgridmap
>>>>>>> and our mysql backup happen at the time of these failures, but as
>>>>>>> these are
>>>>>>> 6 hourly cronjobs I would also expect them to interfere with the
>>>>>>> midnight
>>>>>>> and midday tests. Also the error message doesn't quite fit with what
>>>>>>> I'd
>>>>>>> expect (last time we saw a similar error message it was caused by
>>>>>>> network
>>>>>>> problems between the worker nodes/CE and the SE). I'd appreciate any
>>>>>>> wisdom
>>>>>>> on this matter.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Matt
>>>>>>>
>>>>>>> + lcg-cr --version
>>>>>>> lcg_util-1.6.15
>>>>>>> GFAL-client-1.10.17
>>>>>>> + set +x
>>>>>>>
>>>>>>> + lcg-cr -t 120 -v --vo ops file:/home/samops/.same/SE/testFile.txt -l
>>>>>>> lfn:SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507 -d
>>>>>>> fal-pygrid-30.lancs.ac.uk
>>>>>>> Using grid catalog type: lfc
>>>>>>> Using grid catalog : prod-lfc-shared-central.cern.ch
>>>>>>> Using LFN :
>>>>>>> /grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>>> [BDII] sam-bdii.cern.ch:2170: Warning, no GlueVOInfo information found
>>>>>>> about tag '(null)' and SE 'fal-pygrid-30.lancs.ac.uk'
>>>>>>> SE type: SRMv1
>>>>>>> Using SURL :
>>>>>>> srm://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/ops/generated/2009-01-09/file33df0c61-861c-4f81-9efa-3c6999a6d6d1
>>>>>>> Alias registered in Catalog:
>>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>>> Alias registered in Catalog:
>>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>>> Alias registered in Catalog:
>>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>>> Alias registered in Catalog:
>>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>>> Alias registered in Catalog:
>>>>>>> lfn:/grid/ops/SAM/SE-lcg-cr-fal-pygrid-30.lancs.ac.uk-1231481507
>>>>>>> [SE][put] httpg://fal-pygrid-30.lancs.ac.uk:8443/srm/managerv1:
>>>>>>> CGSI-gSOAP: Error reading token data header: Connection closed
>>>>>>> lcg_cr: Operation now in progress
>>>>>>> + result=1
>>>>>>> + set +x
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
>
--
---
Phil Roffe - [log in to unmask]
IPPP, Department of Physics, Durham University,
Science Laboratories, South Road, Durham, DH1 3LE
Direct Dial: +44 (0)191 3343704
Office: +44 (0)191 334 3811
|