On 02/22/2012 12:29 PM, Chris Brew wrote:
> Hi Jeff,
>
> I cannot take the credit, we've discussed this quite a bit in GridPP.
>
> There's at least one ticket in the system about it:
>
> https://ggus.eu/tech/ticket_show.php?ticket=72506
>
> I've recently tried setting 'delegation_purge_rate="-1"'[1] on one of our
> CreamCEs to see if that helps. Initial indications are that it doesn't.
You mean that the proxy is cleaned even when delegation_purge_rate is
-1, or you mean that the proxy is not cleaned but you still see the
Q->W->Q-> ... problem ?
>
> Yours,
> Chris.
>
> [1]
> http://grid.pd.infn.it/cream/field.php?n=Main.HowToConfigureTheProxyPurger
>
>> -----Original Message-----
>> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
>> On Behalf Of Jeff Templon
>> Sent: 22 February 2012 11:18
>> To: [log in to unmask]
>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
>> completion?
>>
>> Hi,
>>
>> On Feb 22, 2012, at 11:29 , Chris Brew wrote:
>>
>>> Hi,
>>>
>>> I believe this is what happens when a jobs proxy expires while the
>> job
>>> is in the queue.
>>>
>>> AFAICT, the Cream Proxy cleanup deletes the expired proxy from the
>>> cream proxy cache but does not cancel the job. So the job happily
>> sits
>>> in the queue until it torque tries to run it, at which point it tries
>>> to stage in the proxy and of course fails.
>>>
>>> I think then it puts the job into waiting (W) state for a while
>> before
>>> requeuing it and it then just oscillates between Q and W until you
>>> delete it.
>>>
>>
>> We see exactly this : oscillations between state W and state Q,
>> continuing forever until the job is deleted. Thanks for the
>> explanation. Chris, you want to submit the GGUS ticket on this one on
>> behalf of the rest of us?
>>
>> JT
>>
>>> It's annoying but you're not losing jobs that would have run.
>>>
>>> Yours,
>>> Chris.
>>>
>>>> -----Original Message-----
>>>> From: LHC Computer Grid - Rollout [mailto:LCG-
>> [log in to unmask]]
>>>> On Behalf Of Jeff Templon
>>>> Sent: 22 February 2012 09:50
>>>> To: [log in to unmask]
>>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
>>>> completion?
>>>>
>>>> Hi,
>>>>
>>>> we see these messages here too at Nikhef (giving up after 4
>> attempts)
>>>> and I cannot link them to jobs that have actually executed. which
>> is
>>>> quite strange.
>>>>
>>>> this would indicate that somehow torque does not know that the job
>>>> has landed on a worker node, because there are no "start" or "end"
>>>> records that reference the WN in question in the torque server logs.
>>>> however, the job has somehow landed on a worker node, because the
>>>> error message 'unable to copy file' is coming from the mom on a
>> worker node.
>>>>
>>>> This may be related to something we see here happen from time to
>> time.
>>>> A job is in state "Q", but somehow it has a worker node assigned to
>> it.
>>>> We usually delete these jobs when we see them, as there is no way
>>>> we've found to 'unassign' the worker node.
>>>>
>>>>
>>> JT
>>>>
>>>> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote:
>>>>
>>>>> Hi Leslie,
>>>>>
>>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>>> another) we've seen an error somehow close to what you describe
>> here.
>>>> In our /var/log/messages we find tons of entires like this:
>>>>>
>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError
>>>>>
>>>>
>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
>>>> _
>>>>>
>>>>
>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
>>>> P
>>>>>
>>>>
>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
>>>> n
>>>>> dardError' failed with status=1, giving up after 4 attempts
>>>>>
>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to
>>>>> copy file err_cre02_780422084_StandardError to
>>>>>
>>>>
>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
>>>> _
>>>>>
>>>>
>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
>>>> P
>>>>>
>>>>
>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
>>>> n
>>>>> dardError
>>>>>
>>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput
>>>>>
>>>>
>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
>>>> _
>>>>>
>>>>
>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
>>>> P
>>>>>
>>>>
>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
>>>> n
>>>>> dardOutput' failed with status=1, giving up after 4 attempts F
>>>>>
>>>>> Have you checked whether these failed transfers are from cancelled
>>>> jobs? In our experience, it was always the case.
>>>>>
>>>>> We've also looked at ways to mitigate this annoying verbosity, but
>>>>> no
>>>> luck so far. The only option that we can think of is to stop using
>>>> scp for copies and move the sandbox to a shared area with the WNs,
>> so
>>>> you use regular cp ($usecp directive) and these errors are hidden.
>>>> But, of course, this approach has its own disadvantages as well.
>>>>>
>>>>> Does anyone else have a better idea?
>>>>>
>>>>> Cheers,
>>>>> Miguel
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Miguel Gila
>>>>> CSCS Swiss National Supercomputing Centre HPC Solutions Via
>>>> Cantonale,
>>>>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch
>>>>>
>>>>> From: Leslie Groer
>>>>> <[log in to unmask]<mailto:[log in to unmask]>>
>>>>> Reply-To: LHC Computer Grid - Rollout
>>>>> <[log in to unmask]<mailto:[log in to unmask]>>
>>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>>>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job
>> completion?
>>>>>
>>>>> scp:
>>>>>
>>>>
>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_As
>>>> o
>>>>>
>>>>
>> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CRE
>>>> A
>>>>> M619112035/StandardOutput: No such file or directory
|