Hi Jeff,
I cannot take the credit, we've discussed this quite a bit in GridPP.
There's at least one ticket in the system about it:
https://ggus.eu/tech/ticket_show.php?ticket=72506
I've recently tried setting 'delegation_purge_rate="-1"'[1] on one of our
CreamCEs to see if that helps. Initial indications are that it doesn't.
Yours,
Chris.
[1]
http://grid.pd.infn.it/cream/field.php?n=Main.HowToConfigureTheProxyPurger
> -----Original Message-----
> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
> On Behalf Of Jeff Templon
> Sent: 22 February 2012 11:18
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
> completion?
>
> Hi,
>
> On Feb 22, 2012, at 11:29 , Chris Brew wrote:
>
> > Hi,
> >
> > I believe this is what happens when a jobs proxy expires while the
> job
> > is in the queue.
> >
> > AFAICT, the Cream Proxy cleanup deletes the expired proxy from the
> > cream proxy cache but does not cancel the job. So the job happily
> sits
> > in the queue until it torque tries to run it, at which point it tries
> > to stage in the proxy and of course fails.
> >
> > I think then it puts the job into waiting (W) state for a while
> before
> > requeuing it and it then just oscillates between Q and W until you
> > delete it.
> >
>
> We see exactly this : oscillations between state W and state Q,
> continuing forever until the job is deleted. Thanks for the
> explanation. Chris, you want to submit the GGUS ticket on this one on
> behalf of the rest of us?
>
> JT
>
> > It's annoying but you're not losing jobs that would have run.
> >
> > Yours,
> > Chris.
> >
> >> -----Original Message-----
> >> From: LHC Computer Grid - Rollout [mailto:LCG-
> [log in to unmask]]
> >> On Behalf Of Jeff Templon
> >> Sent: 22 February 2012 09:50
> >> To: [log in to unmask]
> >> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
> >> completion?
> >>
> >> Hi,
> >>
> >> we see these messages here too at Nikhef (giving up after 4
> attempts)
> >> and I cannot link them to jobs that have actually executed. which
> is
> >> quite strange.
> >>
> >> this would indicate that somehow torque does not know that the job
> >> has landed on a worker node, because there are no "start" or "end"
> >> records that reference the WN in question in the torque server logs.
> >> however, the job has somehow landed on a worker node, because the
> >> error message 'unable to copy file' is coming from the mom on a
> worker node.
> >>
> >> This may be related to something we see here happen from time to
> time.
> >> A job is in state "Q", but somehow it has a worker node assigned to
> it.
> >> We usually delete these jobs when we see them, as there is no way
> >> we've found to 'unassign' the worker node.
> >>
> >>
> > JT
> >>
> >> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote:
> >>
> >>> Hi Leslie,
> >>>
> >>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
> >> another) we've seen an error somehow close to what you describe
> here.
> >> In our /var/log/messages we find tons of entires like this:
> >>>
> >>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
> >>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError
> >>>
> >>
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
> >> _
> >>>
> >>
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
> >> P
> >>>
> >>
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
> >> n
> >>> dardError' failed with status=1, giving up after 4 attempts
> >>>
> >>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to
> >>> copy file err_cre02_780422084_StandardError to
> >>>
> >>
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
> >> _
> >>>
> >>
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
> >> P
> >>>
> >>
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
> >> n
> >>> dardError
> >>>
> >>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
> >>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput
> >>>
> >>
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC
> >> _
> >>>
> >>
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_
> >> P
> >>>
> >>
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta
> >> n
> >>> dardOutput' failed with status=1, giving up after 4 attempts F
> >>>
> >>> Have you checked whether these failed transfers are from cancelled
> >> jobs? In our experience, it was always the case.
> >>>
> >>> We've also looked at ways to mitigate this annoying verbosity, but
> >>> no
> >> luck so far. The only option that we can think of is to stop using
> >> scp for copies and move the sandbox to a shared area with the WNs,
> so
> >> you use regular cp ($usecp directive) and these errors are hidden.
> >> But, of course, this approach has its own disadvantages as well.
> >>>
> >>> Does anyone else have a better idea?
> >>>
> >>> Cheers,
> >>> Miguel
> >>>
> >>>
> >>>
> >>> --
> >>> Miguel Gila
> >>> CSCS Swiss National Supercomputing Centre HPC Solutions Via
> >> Cantonale,
> >>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch
> >>>
> >>> From: Leslie Groer
> >>> <[log in to unmask]<mailto:[log in to unmask]>>
> >>> Reply-To: LHC Computer Grid - Rollout
> >>> <[log in to unmask]<mailto:[log in to unmask]>>
> >>> Date: Thu, 16 Feb 2012 14:43:56 -0500
> >>> To: <[log in to unmask]<mailto:[log in to unmask]>>
> >>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job
> completion?
> >>>
> >>> scp:
> >>>
> >>
> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_As
> >> o
> >>>
> >>
> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CRE
> >> A
> >>> M619112035/StandardOutput: No such file or directory
|