Hi, The cause of state flaping ( Q->W->Q->W->Q ) is the following i think: * wn gets the job with stagein definition (set of files to be copied to WN) * wn tries to copy the files so job turns into W state * wn does several copy attempts before it gives up and returns job to a Q state (may take a long time) * maui/moab tries to start the queued job again so the situation repeats until job gets held/canceled by scheduler. Some maui configurations put a job into a hold state after serveral unsuccesfull job start attempts. Such hold can be released automatically by maui after a certain period defined by DEFERTIME if i recall correctly The fact is that job blocks resources and it must be deleted by hand > Not sure, I only recently changed it on one of the CreamCEs and having just > looked we have got jobs from that CreamCE in W state. > > I haven't checked if these were submitted before I made the change or > whether they failed because the proxy was removed. > > I need to give the change a little longer then if there are still jobs > appearing from the CE in W state do some more analysis. > > Yours, > Chris. > >> -----Original Message----- >> From: Massimo Sgaravatto [mailto:[log in to unmask]] >> Sent: 22 February 2012 11:41 >> To: LHC Computer Grid - Rollout >> Cc: Brew, Chris (STFC,RAL,PPD) >> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job >> completion? >> >> On 02/22/2012 12:29 PM, Chris Brew wrote: >>> Hi Jeff, >>> >>> I cannot take the credit, we've discussed this quite a bit in GridPP. >>> >>> There's at least one ticket in the system about it: >>> >>> https://ggus.eu/tech/ticket_show.php?ticket=72506 >>> >>> I've recently tried setting 'delegation_purge_rate="-1"'[1] on one of >> our >>> CreamCEs to see if that helps. Initial indications are that it >> doesn't. >> >> >> You mean that the proxy is cleaned even when delegation_purge_rate is >> -1, or you mean that the proxy is not cleaned but you still see the >> Q->W->Q-> ... problem ? >> >>> >>> Yours, >>> Chris. >>> >>> [1] >>> >> http://grid.pd.infn.it/cream/field.php?n=Main.HowToConfigureTheProxyPur >> ger >>> >>>> -----Original Message----- >>>> From: LHC Computer Grid - Rollout [mailto:LCG- >> [log in to unmask]] >>>> On Behalf Of Jeff Templon >>>> Sent: 22 February 2012 11:18 >>>> To: [log in to unmask] >>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job >>>> completion? >>>> >>>> Hi, >>>> >>>> On Feb 22, 2012, at 11:29 , Chris Brew wrote: >>>> >>>>> Hi, >>>>> >>>>> I believe this is what happens when a jobs proxy expires while the >>>> job >>>>> is in the queue. >>>>> >>>>> AFAICT, the Cream Proxy cleanup deletes the expired proxy from the >>>>> cream proxy cache but does not cancel the job. So the job happily >>>> sits >>>>> in the queue until it torque tries to run it, at which point it >> tries >>>>> to stage in the proxy and of course fails. >>>>> >>>>> I think then it puts the job into waiting (W) state for a while >>>> before >>>>> requeuing it and it then just oscillates between Q and W until you >>>>> delete it. >>>>> >>>> >>>> We see exactly this : oscillations between state W and state Q, >>>> continuing forever until the job is deleted. Thanks for the >>>> explanation. Chris, you want to submit the GGUS ticket on this one >> on >>>> behalf of the rest of us? >>>> >>>> JT >>>> >>>>> It's annoying but you're not losing jobs that would have run. >>>>> >>>>> Yours, >>>>> Chris. >>>>> >>>>>> -----Original Message----- >>>>>> From: LHC Computer Grid - Rollout [mailto:LCG- >>>> [log in to unmask]] >>>>>> On Behalf Of Jeff Templon >>>>>> Sent: 22 February 2012 09:50 >>>>>> To: [log in to unmask] >>>>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job >>>>>> completion? >>>>>> >>>>>> Hi, >>>>>> >>>>>> we see these messages here too at Nikhef (giving up after 4 >>>> attempts) >>>>>> and I cannot link them to jobs that have actually executed. which >>>> is >>>>>> quite strange. >>>>>> >>>>>> this would indicate that somehow torque does not know that the job >>>>>> has landed on a worker node, because there are no "start" or "end" >>>>>> records that reference the WN in question in the torque server >> logs. >>>>>> however, the job has somehow landed on a worker node, because the >>>>>> error message 'unable to copy file' is coming from the mom on a >>>> worker node. >>>>>> >>>>>> This may be related to something we see here happen from time to >>>> time. >>>>>> A job is in state "Q", but somehow it has a worker node assigned >> to >>>> it. >>>>>> We usually delete these jobs when we see them, as there is no way >>>>>> we've found to 'unassign' the worker node. >>>>>> >>>>>> >>>>> JT >>>>>> >>>>>> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote: >>>>>> >>>>>>> Hi Leslie, >>>>>>> >>>>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on >>>>>> another) we've seen an error somehow close to what you describe >>>> here. >>>>>> In our /var/log/messages we find tons of entires like this: >>>>>>> >>>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command >>>>>>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError >>>>>>> >>>>>> >>>> >> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC >>>>>> _ >>>>>>> >>>>>> >>>> >> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_ >>>>>> P >>>>>>> >>>>>> >>>> >> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta >>>>>> n >>>>>>> dardError' failed with status=1, giving up after 4 attempts >>>>>>> >>>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to >>>>>>> copy file err_cre02_780422084_StandardError to >>>>>>> >>>>>> >>>> >> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC >>>>>> _ >>>>>>> >>>>>> >>>> >> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_ >>>>>> P >>>>>>> >>>>>> >>>> >> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta >>>>>> n >>>>>>> dardError >>>>>>> >>>>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command >>>>>>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput >>>>>>> >>>>>> >>>> >> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC >>>>>> _ >>>>>>> >>>>>> >>>> >> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_ >>>>>> P >>>>>>> >>>>>> >>>> >> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Sta >>>>>> n >>>>>>> dardOutput' failed with status=1, giving up after 4 attempts F >>>>>>> >>>>>>> Have you checked whether these failed transfers are from >> cancelled >>>>>> jobs? In our experience, it was always the case. >>>>>>> >>>>>>> We've also looked at ways to mitigate this annoying verbosity, >> but >>>>>>> no >>>>>> luck so far. The only option that we can think of is to stop using >>>>>> scp for copies and move the sandbox to a shared area with the WNs, >>>> so >>>>>> you use regular cp ($usecp directive) and these errors are hidden. >>>>>> But, of course, this approach has its own disadvantages as well. >>>>>>> >>>>>>> Does anyone else have a better idea? >>>>>>> >>>>>>> Cheers, >>>>>>> Miguel >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Miguel Gila >>>>>>> CSCS Swiss National Supercomputing Centre HPC Solutions Via >>>>>> Cantonale, >>>>>>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch >>>>>>> >>>>>>> From: Leslie Groer >>>>>>> <[log in to unmask]<mailto:[log in to unmask]>> >>>>>>> Reply-To: LHC Computer Grid - Rollout >>>>>>> <[log in to unmask]<mailto:[log in to unmask]>> >>>>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500 >>>>>>> To:<[log in to unmask]<mailto:LCG- >> [log in to unmask]>> >>>>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job >>>> completion? >>>>>>> >>>>>>> scp: >>>>>>> >>>>>> >>>> >> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_As >>>>>> o >>>>>>> >>>>>> >>>> >> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CRE >>>>>> A >>>>>>> M619112035/StandardOutput: No such file or directory >> >