JISCMail - LCG-ROLLOUT Archives

Hi,

The cause of state flaping ( Q->W->Q->W->Q  ) is the following i think:

* wn gets the job with stagein definition (set of files to be copied to WN)
* wn tries to copy the files so job turns into W state
* wn does several copy attempts before it gives up and returns job to a 
Q state (may take a long time)
* maui/moab tries to start the queued job again so the situation repeats 
until job gets held/canceled by scheduler. Some maui configurations put 
a job into a hold state after serveral unsuccesfull job start attempts. 
Such hold can be released automatically by maui after a certain period 
defined by DEFERTIME if i recall correctly

The fact is that job blocks resources and it must be deleted by hand






> Not sure, I only recently changed it on one of the CreamCEs and having just
> looked we have got jobs from that CreamCE in W state.
>
> I haven't checked if these were submitted before I made the change or
> whether they failed because the proxy was removed.
>
> I need to give the change a little longer then if there are still jobs
> appearing from the CE in W state do some more analysis.
>
> Yours,
> Chris.
>
>> -----Original Message-----
>> From: Massimo Sgaravatto [mailto:[log in to unmask]]
>> Sent: 22 February 2012 11:41
>> To: LHC Computer Grid - Rollout
>> Cc: Brew, Chris (STFC,RAL,PPD)
>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
>> completion?
>>
>> On 02/22/2012 12:29 PM, Chris Brew wrote:
>>> Hi Jeff,
>>>
>>> I cannot take the credit, we've discussed this quite a bit in GridPP.
>>>
>>> There's at least one ticket in the system about it:
>>>
>>> https://ggus.eu/tech/ticket_show.php?ticket=72506
>>>
>>> I've recently tried setting 'delegation_purge_rate="-1"'[1] on one of
>> our
>>> CreamCEs to see if that helps. Initial indications are that it
>> doesn't.
>>
>>
>> You mean that the proxy is cleaned even when delegation_purge_rate is
>> -1, or you mean that the proxy is not cleaned but you still see the
>> Q->W->Q->  ... problem ?
>>
>>>
>>> Yours,
>>> Chris.
>>>
>>> [1]
>>>
>> http://grid.pd.infn.it/cream/field.php?n=Main.HowToConfigureTheProxyPur
>> ger
>>>
>>>> -----Original Message-----
>>>> From: LHC Computer Grid - Rollout [mailto:LCG-
>> [log in to unmask]]
>>>> On Behalf Of Jeff Templon
>>>> Sent: 22 February 2012 11:18
>>>> To: [log in to unmask]
>>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
>>>> completion?
>>>>
>>>> Hi,
>>>>
>>>> On Feb 22, 2012, at 11:29 , Chris Brew wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I believe this is what happens when a jobs proxy expires while the
>>>> job
>>>>> is in the queue.
>>>>>
>>>>> AFAICT, the Cream Proxy cleanup deletes the expired proxy from the
>>>>> cream proxy cache but does not cancel the job. So the job happily
>>>> sits
>>>>> in the queue until it torque tries to run it, at which point it
>> tries
>>>>> to stage in the proxy and of course fails.
>>>>>
>>>>> I think then it puts the job into waiting (W) state for a while
>>>> before
>>>>> requeuing it and it then just oscillates between Q and W until you
>>>>> delete it.
>>>>>
>>>>
>>>> We see exactly this : oscillations between state W and state Q,
>>>> continuing forever until the job is deleted.  Thanks for the
>>>> explanation.   Chris, you want to submit the GGUS ticket on this one
>> on
>>>> behalf of the rest of us?
>>>>
>>>> 				JT
>>>>
>>>>> It's annoying but you're not losing jobs that would have run.
>>>>>
>>>>> Yours,
>>>>> Chris.
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: LHC Computer Grid - Rollout [mailto:LCG-
>>>> [log in to unmask]]
>>>>>> On Behalf Of Jeff Templon
>>>>>> Sent: 22 February 2012 09:50
>>>>>> To: [log in to unmask]
>>>>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
>>>>>> completion?
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> we see these messages here too at Nikhef (giving up after 4
>>>> attempts)
>>>>>> and I cannot link them to jobs that have actually executed.  which
>>>> is
>>>>>> quite strange.
>>>>>>
>>>>>> this would indicate that somehow torque does not know that the job
>>>>>> has landed on a worker node, because there are no "start" or "end"
>>>>>> records that reference the WN in question in the torque server
>> logs.
>>>>>> however, the job has somehow landed on a worker node, because the
>>>>>> error message 'unable to copy file' is coming from the mom on a
>>>> worker node.
>>>>>>
>>>>>> This may be related to something we see here happen from time to
>>>> time.
>>>>>> A job is in state "Q", but somehow it has a worker node assigned
>> to
>>>> it.
>>>>>> We usually delete these jobs when we see them, as there is no way
>>>>>> we've found to 'unassign' the worker node.
>>>>>>
>>>>>>
>>>>> JT
>>>>>>
>>>>>> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote:
>>>>>>
>>>>>>> Hi Leslie,
>>>>>>>
>>>>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>>>>> another) we've seen an error somehow close to what you describe
>>>> here.
>>>>>> In our /var/log/messages we find tons of entires like this:
>>>>>>>
>>>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>>>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError
>>>>>>>
>>>>>>
>>>>
>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
>>>>>> _
>>>>>>>
>>>>>>
>>>>
>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
>>>>>> P
>>>>>>>
>>>>>>
>>>>
>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
>>>>>> n
>>>>>>> dardError' failed with status=1, giving up after 4 attempts
>>>>>>>
>>>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to
>>>>>>> copy file err_cre02_780422084_StandardError to
>>>>>>>
>>>>>>
>>>>
>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
>>>>>> _
>>>>>>>
>>>>>>
>>>>
>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
>>>>>> P
>>>>>>>
>>>>>>
>>>>
>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
>>>>>> n
>>>>>>> dardError
>>>>>>>
>>>>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>>>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput
>>>>>>>
>>>>>>
>>>>
>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
>>>>>> _
>>>>>>>
>>>>>>
>>>>
>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
>>>>>> P
>>>>>>>
>>>>>>
>>>>
>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
>>>>>> n
>>>>>>> dardOutput' failed with status=1, giving up after 4 attempts F
>>>>>>>
>>>>>>> Have you checked whether these failed transfers are from
>> cancelled
>>>>>> jobs? In our experience, it was always the case.
>>>>>>>
>>>>>>> We've also looked at ways to mitigate this annoying verbosity,
>> but
>>>>>>> no
>>>>>> luck so far. The only option that we can think of is to stop using
>>>>>> scp for copies and move the sandbox to a shared area with the WNs,
>>>> so
>>>>>> you use regular cp ($usecp directive) and these errors are hidden.
>>>>>> But, of course, this approach has its own disadvantages as well.
>>>>>>>
>>>>>>> Does anyone else have a better idea?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Miguel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Miguel Gila
>>>>>>> CSCS Swiss National Supercomputing Centre HPC Solutions Via
>>>>>> Cantonale,
>>>>>>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch
>>>>>>>
>>>>>>> From: Leslie Groer
>>>>>>> <[log in to unmask]<mailto:[log in to unmask]>>
>>>>>>> Reply-To: LHC Computer Grid - Rollout
>>>>>>> <[log in to unmask]<mailto:[log in to unmask]>>
>>>>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>>>>>> To:<[log in to unmask]<mailto:LCG-
>> [log in to unmask]>>
>>>>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job
>>>> completion?
>>>>>>>
>>>>>>> scp:
>>>>>>>
>>>>>>
>>>>
>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_As
>>>>>> o
>>>>>>>
>>>>>>
>>>>
>> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CRE
>>>>>> A
>>>>>>> M619112035/StandardOutput: No such file or directory
>>
>