JISCMail - LCG-ROLLOUT Archives

Not sure, I only recently changed it on one of the CreamCEs and having just
looked we have got jobs from that CreamCE in W state.

I haven't checked if these were submitted before I made the change or
whether they failed because the proxy was removed.

I need to give the change a little longer then if there are still jobs
appearing from the CE in W state do some more analysis.

Yours,
Chris.

> -----Original Message-----
> From: Massimo Sgaravatto [mailto:[log in to unmask]]
> Sent: 22 February 2012 11:41
> To: LHC Computer Grid - Rollout
> Cc: Brew, Chris (STFC,RAL,PPD)
> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
> completion?
> 
> On 02/22/2012 12:29 PM, Chris Brew wrote:
> > Hi Jeff,
> >
> > I cannot take the credit, we've discussed this quite a bit in GridPP.
> >
> > There's at least one ticket in the system about it:
> >
> > https://ggus.eu/tech/ticket_show.php?ticket=72506
> >
> > I've recently tried setting 'delegation_purge_rate="-1"'[1] on one of
> our
> > CreamCEs to see if that helps. Initial indications are that it
> doesn't.
> 
> 
> You mean that the proxy is cleaned even when delegation_purge_rate is
> -1, or you mean that the proxy is not cleaned but you still see the
> Q->W->Q-> ... problem ?
> 
> >
> > Yours,
> > Chris.
> >
> > [1]
> >
> http://grid.pd.infn.it/cream/field.php?n=Main.HowToConfigureTheProxyPur
> ger
> >
> >> -----Original Message-----
> >> From: LHC Computer Grid - Rollout [mailto:LCG-
> [log in to unmask]]
> >> On Behalf Of Jeff Templon
> >> Sent: 22 February 2012 11:18
> >> To: [log in to unmask]
> >> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
> >> completion?
> >>
> >> Hi,
> >>
> >> On Feb 22, 2012, at 11:29 , Chris Brew wrote:
> >>
> >>> Hi,
> >>>
> >>> I believe this is what happens when a jobs proxy expires while the
> >> job
> >>> is in the queue.
> >>>
> >>> AFAICT, the Cream Proxy cleanup deletes the expired proxy from the
> >>> cream proxy cache but does not cancel the job. So the job happily
> >> sits
> >>> in the queue until it torque tries to run it, at which point it
> tries
> >>> to stage in the proxy and of course fails.
> >>>
> >>> I think then it puts the job into waiting (W) state for a while
> >> before
> >>> requeuing it and it then just oscillates between Q and W until you
> >>> delete it.
> >>>
> >>
> >> We see exactly this : oscillations between state W and state Q,
> >> continuing forever until the job is deleted.  Thanks for the
> >> explanation.   Chris, you want to submit the GGUS ticket on this one
> on
> >> behalf of the rest of us?
> >>
> >> 				JT
> >>
> >>> It's annoying but you're not losing jobs that would have run.
> >>>
> >>> Yours,
> >>> Chris.
> >>>
> >>>> -----Original Message-----
> >>>> From: LHC Computer Grid - Rollout [mailto:LCG-
> >> [log in to unmask]]
> >>>> On Behalf Of Jeff Templon
> >>>> Sent: 22 February 2012 09:50
> >>>> To: [log in to unmask]
> >>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
> >>>> completion?
> >>>>
> >>>> Hi,
> >>>>
> >>>> we see these messages here too at Nikhef (giving up after 4
> >> attempts)
> >>>> and I cannot link them to jobs that have actually executed.  which
> >> is
> >>>> quite strange.
> >>>>
> >>>> this would indicate that somehow torque does not know that the job
> >>>> has landed on a worker node, because there are no "start" or "end"
> >>>> records that reference the WN in question in the torque server
> logs.
> >>>> however, the job has somehow landed on a worker node, because the
> >>>> error message 'unable to copy file' is coming from the mom on a
> >> worker node.
> >>>>
> >>>> This may be related to something we see here happen from time to
> >> time.
> >>>> A job is in state "Q", but somehow it has a worker node assigned
> to
> >> it.
> >>>> We usually delete these jobs when we see them, as there is no way
> >>>> we've found to 'unassign' the worker node.
> >>>>
> >>>>
> >>> JT
> >>>>
> >>>> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote:
> >>>>
> >>>>> Hi Leslie,
> >>>>>
> >>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
> >>>> another) we've seen an error somehow close to what you describe
> >> here.
> >>>> In our /var/log/messages we find tons of entires like this:
> >>>>>
> >>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
> >>>>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError
> >>>>>
> >>>>
> >>
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
> >>>> _
> >>>>>
> >>>>
> >>
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
> >>>> P
> >>>>>
> >>>>
> >>
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
> >>>> n
> >>>>> dardError' failed with status=1, giving up after 4 attempts
> >>>>>
> >>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to
> >>>>> copy file err_cre02_780422084_StandardError to
> >>>>>
> >>>>
> >>
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
> >>>> _
> >>>>>
> >>>>
> >>
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
> >>>> P
> >>>>>
> >>>>
> >>
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
> >>>> n
> >>>>> dardError
> >>>>>
> >>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
> >>>>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput
> >>>>>
> >>>>
> >>
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
> >>>> _
> >>>>>
> >>>>
> >>
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
> >>>> P
> >>>>>
> >>>>
> >>
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
> >>>> n
> >>>>> dardOutput' failed with status=1, giving up after 4 attempts F
> >>>>>
> >>>>> Have you checked whether these failed transfers are from
> cancelled
> >>>> jobs? In our experience, it was always the case.
> >>>>>
> >>>>> We've also looked at ways to mitigate this annoying verbosity,
> but
> >>>>> no
> >>>> luck so far. The only option that we can think of is to stop using
> >>>> scp for copies and move the sandbox to a shared area with the WNs,
> >> so
> >>>> you use regular cp ($usecp directive) and these errors are hidden.
> >>>> But, of course, this approach has its own disadvantages as well.
> >>>>>
> >>>>> Does anyone else have a better idea?
> >>>>>
> >>>>> Cheers,
> >>>>> Miguel
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Miguel Gila
> >>>>> CSCS Swiss National Supercomputing Centre HPC Solutions Via
> >>>> Cantonale,
> >>>>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch
> >>>>>
> >>>>> From: Leslie Groer
> >>>>> <[log in to unmask]<mailto:[log in to unmask]>>
> >>>>> Reply-To: LHC Computer Grid - Rollout
> >>>>> <[log in to unmask]<mailto:[log in to unmask]>>
> >>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
> >>>>> To:<[log in to unmask]<mailto:LCG-
> [log in to unmask]>>
> >>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job
> >> completion?
> >>>>>
> >>>>> scp:
> >>>>>
> >>>>
> >>
> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_As
> >>>> o
> >>>>>
> >>>>
> >>
> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CRE
> >>>> A
> >>>>> M619112035/StandardOutput: No such file or directory
>