Il giorno 30/set/2011, alle ore 14:05, Chris Brew ha scritto:
> Hi Massimo,
>
> Our publishing is all worked out now but again last night/this morning we
> had a problem with one of our CreamCEs being blacklisted by a CERN WMS[1].
>
> Other WMSes were able to submit to it without problems[2] and I could submit
> through them without problems:
>
> ======================= glite-wms-job-status Success =====================
> BOOKKEEPING INFORMATION:
>
> Status info for the Job :
> https://lcglb01.gridpp.rl.ac.uk:9000/2FUq2HbIIihZKQVssCgVfg
> Current Status: Done (Success)
> Logged Reason(s):
> - job completed
> - Job Terminated Successfully
> Exit code: 0
> Status Reason: Job Terminated Successfully
> Destination: heplnx206.pp.rl.ac.uk:8443/cream-pbs-grid
> Submitted: Fri Sep 30 09:54:08 2011 BST
> ==========================================================================
>
> But not via the CERN WMS:
>
> ======================= glite-wms-job-status Success =====================
> BOOKKEEPING INFORMATION:
>
> Status info for the Job : https://wms208.cern.ch:9000/8wJMDjyD_od-XPQjWoEpIA
> Current Status: Aborted
> Logged Reason(s):
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> - Transfer to CREAM failed due to exception: CREAM Register raised
> std::exception The endpoint is blacklisted
> Status Reason: hit job shallow retry count (10)
> Destination: heplnx206.pp.rl.ac.uk:8443/cream-pbs-grid
> Submitted: Fri Sep 30 10:19:51 2011 BST
> ==========================================================================
>
> Restarting tomcat did not seem to fix it, though restarting the node did
do you mean the CE node ?
> (although since the error is apparently the WMS refusing to submit to the
> CreamCE it is possible that blacklisting expired after my test job after
> restarting tomcat and before my test job after restarting the node).
> The glite-cream-ce.log shows connections from the WMS in question after I
> submit the job (only for delegation) with no apparent failures but no
> attempt to submit a job. (See attached fragments of the log file)
Please remember that a CE remain in the Blacklist for 30 minutes (only EventQuery is allowed to that CE during this period).
>
> Load on the node itself and on the database seems to be very low throughout
> the period[3].
>
> This is happening to both of our UMD 1.1 Cream CEs submitting to a
> torque/maui batch system and using gLite 3.2 Argus as an authentication
> backend.
> I does not appear to be happening to our gLite 3.2 CreamCE not using Argus.
>
> At this point I'm lost for anything else to try as far as I can tell on the
> node there is no evidence of a problem.
>
> Any suggestions welcome - are there logs on the WMS side that can show which
> actual calls are timing out?
>
> Yours,
> Chris Brew
>
> (having done this much debugging I suppose I should put this in as a GGUS
> ticket - I'll do that after lunch).
>
> Addendum: Although I was successfully able to submit a job just after
> restarting the node the WMS seems to have blacklisted the CreamCE again now
> and we have later SAM failures [3] though with an more descriptive error:
>
> reason is EOF detected during communication. Probably service closed
> connection or SOCKET TIMEOUT
>
>
>
> [1] http://bit.ly/ppz5tW
> http://dashb-nagios-cms-dev.cern.ch/dashboard/request.py/testhistory?siteSel
> ect3=All%20Sites&sites=T2_UK_SGrid_RALPP&serviceTypeSelect3=all&services=CRE
> AMCE&tests=CREAMCE-org.cms.WN-analysis&tests=CREAMCE-org.cms.WN-basic&tests=
> CREAMCE-org.cms.WN-frontier&tests=CREAMCE-org.cms.WN-mc&tests=CREAMCE-org.cm
> s.WN-squid&tests=CREAMCE-org.cms.WN-swinst&tests=CREAMCE-org.cms.glexec.WN-g
> LExec&tests=CREAMCE-org.sam.CREAMCE-JobSubmit&tests=CREAMCE-org.sam.glexec.C
> REAMCE-JobSubmit&servicename=heplnx206.pp.rl.ac.uk&timeRange=individual&star
> t=2011-09-29&end=2011-10-01
>
> [2] http://bit.ly/mYLB2e
> https://gridppnagios.physics.ox.ac.uk/myegi/history/?facelist_values_regions
> =&facelist_values_sites=&facelist_values_services=6253%2C&profile=5&monitore
> d=2&status=1&status=2&status=3&status=4&status=5&startdate=29-09-2011&enddat
> e=
>
> [3] http://bit.ly/qOgtTI
> https://monitor.pp.rl.ac.uk/ganglia/?r=day&c=RAL+PP+LCG+Infrastructure&h=hep
> lnx206.pp.rl.ac.uk
>
> [4] http://bit.ly/qRh5u1
> https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=heplnx206.
> pp.rl.ac.uk&vo=CMS&testname=CREAMCE-org.sam.CREAMCE-JobSubmit&testtimestamp=
> 1317381323
>
>> -----Original Message-----
>> From: Massimo Sgaravatto [mailto:[log in to unmask]]
>> Sent: 08 September 2011 06:25
>> To: LHC Computer Grid - Rollout
>> Cc: Brew, Chris (STFC,RAL,PPD)
>> Subject: Re: [LCG-ROLLOUT] CreamCEs keep getting blacklisted by WMS
>>
>> On 09/07/2011 04:27 PM, Chris Brew wrote:
>>> Hi,
>>>
>>> I have run:
>>>
>>> glite-ce-service-info -L 2 heplnx206.pp.rl.ac.uk
>>>
>>
>> Hi Chris
>>
>> Did you also try a submission ?
>>
>>
>>
>>> Against the CEs while they are blacklisted with no errors and other
>> WMSes
>>> have continued to submit jobs with no error.
>>
>>
>> So is the CE blacklisted only for the submissions done by a specific
>> WMS
>> (while everything works properly for jobs submitted by other WMSes) ?
>>
>>
>> Cheers, Massimo
>>>
>>> I have occasionally at other times seen submission blocked because
>> the
>>> number of FTP connections is above the threshold of 30 but that's
>> always
>>> transitory while the WMS will block us until I reboot. (The load on
>> the
>>> hardware is still low when that's over threshold so is it possible to
>>> increase it?).
>>>
>>> One suggestion I've had is to create extra indexes in the mysql DB
>> but
>>> that's outside my area of competence. Indeed the mysql daemon is
>> using a
>>> fair amount of CPU and has a good few connections open.
>>>
>>> Yours,
>>> Chris.
>>>
>>>> -----Original Message-----
>>>> From: LHC Computer Grid - Rollout [mailto:LCG-
>> [log in to unmask]]
>>>> On Behalf Of Rodney Walker
>>>> Sent: 07 September 2011 14:55
>>>> To: [log in to unmask]
>>>> Subject: Re: [LCG-ROLLOUT] CreamCEs keep getting blacklisted by WMS
>>>>
>>>> Hi,
>>>> From a lay perspective - I do not know what blacklisting means for
>> the
>>>> WMS -did you try, e.g.
>>>> $ glite-ce-allowed-submission lcg-lrz-ce2.grid.lrz.de
>>>> Job Submission to this CREAM CE is disabled
>>>>
>>>> for your CE and get disabled. The reason mine is now disabled is
>> shown
>>>> by
>>>>
>>>> /opt/glite/bin/glite_cream_load_monitor --show
>>>> Threshold for Swap Usage: 95 => Detected value for Swap Usage:
>> 100.00%
>>>>
>>>> And indeed a reboot will fix it, but also a restart. OTOH there are
>>>> other thresholds listed, which might affect you.
>>>>
>>>> Cheers,
>>>> Rod.
>>>>
>>>>
>>>> On 09/07/2011 03:43 PM, Massimo Sgaravatto - INFN Padova wrote:
>>>>> On Wed, 7 Sep 2011, Chris Brew wrote:
>>>>>
>>>>>> Hi Massimo,
>>>>>>
>>>>>> I'm getting it from the UMD repository so:
>>>>>>
>>>>>> [root@heplnx206 ~]# rpm -qa | grep cream
>>>>>> glite-ce-cream-1.13.2-1.sl5
>>>>>> glite-ce-cream-utils-1.1.0-3.sl5
>>>>>> glite-ce-yaim-cream-ce-4.2.0-3.sl5
>>>>>> emi-cream-ce-1.0.0-1.sl5
>>>>>
>>>>> Ok, I thought that you might be afftected by this bug:
>>>>>
>>>>> https://savannah.cern.ch/bugs/?82567
>>>>>
>>>>> but this shouldn't be the case since your version of CREAM is
>> recent
>>>>> enough
>>>>>
>>>>> When the CE is blacklisted can you try a glite-ce-job-submit
>> towards
>>>>> that CE ? Can you also check the glite-ce-cream.log* ?
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Though I have already "backported" the trustmanager fix from emi.
>>>>>
>>>>>
>>>>> Ok. I guess you know that updating the trustmanager rpm is not
>> enough
>>>>> (also the relevant jar within ce-cream needs to be updated)
>>>>>
>>>>>
>>>>> Cheers, Massimo
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Chris.
>>>>>>
>>>>>> On 07/09/2011 14:33, "Massimo Sgaravatto - INFN Padova"
>>>>>> <[log in to unmask]> wrote:
>>>>>>
>>>>>>> Hi Chris
>>>>>>>
>>>>>>> Maybe I have an idea
>>>>>>> But could you please tell me first what is the version of the
>>>>>>> glite-ce-* rpms ?
>>>>>>>
>>>>>>> Cheers, Massimo
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 7 Sep 2011, Chris Brew wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> After replacing our LCG-CEs with CreamCEs, we keep having
>> problems
>>>>>>>> where
>>>>>>>> our CreamCes get blacklisted by the WMSs that the VO SAM tests
>>>> use. It
>>>>>>>> has
>>>>>>>> happened to all the CEs at some point or other - they seem to
>> run
>>>> fine
>>>>>>>> for
>>>>>>>> a few days to a week then hit this.
>>>>>>>>
>>>>>>>> It is not transitory the SAM tests start failing and continue to
>>>> fail
>>>>>>>> until we intervene. Restarting the gLite services does not
>> appear
>>>>>>>> to fix
>>>>>>>> the but but rebooting does.
>>>>>>>>
>>>>>>>> Other WMSs including the ones used by the NGI_UK ops SAM tests
>>>>>>>> continue
>>>>>>>> to
>>>>>>>> work fine with the CreamCEs.
>>>>>>>>
>>>>>>>> It does not appear to be load related as the boxes seem to have
>>>> plenty
>>>>>>>> of
>>>>>>>> free memory and do not appear to be under heavy load when it
>>>> happens.
>>>>>>>>
>>>>>>>> We've increased the innodb_buffer_pool_size, and reduced the
>> purge
>>>>>>>> times
>>>>>>>> for both the Cream and Blah components which does not appear to
>>>> have
>>>>>>>> fixed
>>>>>>>> the issue.
>>>>>>>>
>>>>>>>> We're using the UMD release with Argus authentication.
>>>>>>>>
>>>>>>>> Any ideas what else I should be trying?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Chris.
>>>>>>>>
>>>>>>>
>>>>>>> \|||/
>>>>>>> -----------0oo----( o o )----oo0-------------------
>>>>>>> (_)
>>>>>>> INFN Sezione di Padova
>>>>>>> Via Marzolo, 8
>>>>>>> 35131 Padova - Italy E-mail: massimo.sgaravatto [at]
>> pd.infn.it
>>>>>>> Tel: ++39 0499677360 Skype: massimo.sgaravatto
>>>>>>> Fax: ++39 0498275952
>>>>>>
>>>>>
>>>>> \|||/
>>>>> -----------0oo----( o o )----oo0-------------------
>>>>> (_)
>>>>> INFN Sezione di Padova
>>>>> Via Marzolo, 8
>>>>> 35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it
>>>>> Tel: ++39 0499677360 Skype: massimo.sgaravatto
>>>>> Fax: ++39 0498275952
>>>>
>>>>
>>>> --
>>>> Tel. +49 89 289 14152
>>
>
> <cream-log-extracts.txt>
|