Not sure, I only recently changed it on one of the CreamCEs and having just looked we have got jobs from that CreamCE in W state. I haven't checked if these were submitted before I made the change or whether they failed because the proxy was removed. I need to give the change a little longer then if there are still jobs appearing from the CE in W state do some more analysis. Yours, Chris. > -----Original Message----- > From: Massimo Sgaravatto [mailto:[log in to unmask]] > Sent: 22 February 2012 11:41 > To: LHC Computer Grid - Rollout > Cc: Brew, Chris (STFC,RAL,PPD) > Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job > completion? > > On 02/22/2012 12:29 PM, Chris Brew wrote: > > Hi Jeff, > > > > I cannot take the credit, we've discussed this quite a bit in GridPP. > > > > There's at least one ticket in the system about it: > > > > https://ggus.eu/tech/ticket_show.php?ticket=72506 > > > > I've recently tried setting 'delegation_purge_rate="-1"'[1] on one of > our > > CreamCEs to see if that helps. Initial indications are that it > doesn't. > > > You mean that the proxy is cleaned even when delegation_purge_rate is > -1, or you mean that the proxy is not cleaned but you still see the > Q->W->Q-> ... problem ? > > > > > Yours, > > Chris. > > > > [1] > > > http://grid.pd.infn.it/cream/field.php?n=Main.HowToConfigureTheProxyPur > ger > > > >> -----Original Message----- > >> From: LHC Computer Grid - Rollout [mailto:LCG- > [log in to unmask]] > >> On Behalf Of Jeff Templon > >> Sent: 22 February 2012 11:18 > >> To: [log in to unmask] > >> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job > >> completion? > >> > >> Hi, > >> > >> On Feb 22, 2012, at 11:29 , Chris Brew wrote: > >> > >>> Hi, > >>> > >>> I believe this is what happens when a jobs proxy expires while the > >> job > >>> is in the queue. > >>> > >>> AFAICT, the Cream Proxy cleanup deletes the expired proxy from the > >>> cream proxy cache but does not cancel the job. So the job happily > >> sits > >>> in the queue until it torque tries to run it, at which point it > tries > >>> to stage in the proxy and of course fails. > >>> > >>> I think then it puts the job into waiting (W) state for a while > >> before > >>> requeuing it and it then just oscillates between Q and W until you > >>> delete it. > >>> > >> > >> We see exactly this : oscillations between state W and state Q, > >> continuing forever until the job is deleted. Thanks for the > >> explanation. Chris, you want to submit the GGUS ticket on this one > on > >> behalf of the rest of us? > >> > >> JT > >> > >>> It's annoying but you're not losing jobs that would have run. > >>> > >>> Yours, > >>> Chris. > >>> > >>>> -----Original Message----- > >>>> From: LHC Computer Grid - Rollout [mailto:LCG- > >> [log in to unmask]] > >>>> On Behalf Of Jeff Templon > >>>> Sent: 22 February 2012 09:50 > >>>> To: [log in to unmask] > >>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job > >>>> completion? > >>>> > >>>> Hi, > >>>> > >>>> we see these messages here too at Nikhef (giving up after 4 > >> attempts) > >>>> and I cannot link them to jobs that have actually executed. which > >> is > >>>> quite strange. > >>>> > >>>> this would indicate that somehow torque does not know that the job > >>>> has landed on a worker node, because there are no "start" or "end" > >>>> records that reference the WN in question in the torque server > logs. > >>>> however, the job has somehow landed on a worker node, because the > >>>> error message 'unable to copy file' is coming from the mom on a > >> worker node. > >>>> > >>>> This may be related to something we see here happen from time to > >> time. > >>>> A job is in state "Q", but somehow it has a worker node assigned > to > >> it. > >>>> We usually delete these jobs when we see them, as there is no way > >>>> we've found to 'unassign' the worker node. > >>>> > >>>> > >>> JT > >>>> > >>>> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote: > >>>> > >>>>> Hi Leslie, > >>>>> > >>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on > >>>> another) we've seen an error somehow close to what you describe > >> here. > >>>> In our /var/log/messages we find tons of entires like this: > >>>>> > >>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command > >>>>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError > >>>>> > >>>> > >> > [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC > >>>> _ > >>>>> > >>>> > >> > cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_ > >>>> P > >>>>> > >>>> > >> > ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta > >>>> n > >>>>> dardError' failed with status=1, giving up after 4 attempts > >>>>> > >>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to > >>>>> copy file err_cre02_780422084_StandardError to > >>>>> > >>>> > >> > [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC > >>>> _ > >>>>> > >>>> > >> > cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_ > >>>> P > >>>>> > >>>> > >> > ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta > >>>> n > >>>>> dardError > >>>>> > >>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command > >>>>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput > >>>>> > >>>> > >> > [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC > >>>> _ > >>>>> > >>>> > >> > cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_ > >>>> P > >>>>> > >>>> > >> > ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta > >>>> n > >>>>> dardOutput' failed with status=1, giving up after 4 attempts F > >>>>> > >>>>> Have you checked whether these failed transfers are from > cancelled > >>>> jobs? In our experience, it was always the case. > >>>>> > >>>>> We've also looked at ways to mitigate this annoying verbosity, > but > >>>>> no > >>>> luck so far. The only option that we can think of is to stop using > >>>> scp for copies and move the sandbox to a shared area with the WNs, > >> so > >>>> you use regular cp ($usecp directive) and these errors are hidden. > >>>> But, of course, this approach has its own disadvantages as well. > >>>>> > >>>>> Does anyone else have a better idea? > >>>>> > >>>>> Cheers, > >>>>> Miguel > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Miguel Gila > >>>>> CSCS Swiss National Supercomputing Centre HPC Solutions Via > >>>> Cantonale, > >>>>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch > >>>>> > >>>>> From: Leslie Groer > >>>>> <[log in to unmask]<mailto:[log in to unmask]>> > >>>>> Reply-To: LHC Computer Grid - Rollout > >>>>> <[log in to unmask]<mailto:[log in to unmask]>> > >>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500 > >>>>> To:<[log in to unmask]<mailto:LCG- > [log in to unmask]>> > >>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job > >> completion? > >>>>> > >>>>> scp: > >>>>> > >>>> > >> > /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_As > >>>> o > >>>>> > >>>> > >> > ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CRE > >>>> A > >>>>> M619112035/StandardOutput: No such file or directory >