Hi I think I can guess what happened, but to be sure I would need the logs of cream (along with an id for a problematic job). Can you please provide this info in a ggus ticket ? Thanks, Massimo On 02/22/2012 12:12 PM, Lukasz Flis wrote: > Hi, > > We observe this behaviour for some months at CYFRONET-LCG2. > > We use cp instead of scp but the result is the same: jobs are hanging in > W state being unable to complete file stagein procesdure as source files > are deleted by CREAM > > Feb 22 12:04:42 <daemon.err> n3-1-2.local pbs_mom[]: > LOG_ERROR::sys_copy, command '/bin/cp -rp > /home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh > CREAM336569863_jobWrapper.sh.35802.21066.1326919965' failed with > status=1, giving up after 4 attempts > Feb 22 12:04:42 <daemon.err> n3-1-2.local pbs_mom[]: > LOG_ERROR::req_cpyfile, Unable to copy file > /home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh > to CREAM336569863_jobWrapper.sh.35802.21066.1326919965 > Feb 22 12:04:42 <daemon.err> n3-1-2.local pbs_mom[]: > LOG_ERROR::req_cpyfile, Unable to copy file > /home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh > to CREAM336569863_jobWrapper.sh.35802.21066.1326919965 *** error from > copy /bin/cp: cannot stat > `/home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh': > No such file or directory *** end error output > > It looks like not only proxyfile is missing but whole job sandbox. > We haven't figured out the cause why cream is removing those files > before job is actually launched yet > > Cheers > -- > Lukasz Flis > >> >> I believe this is what happens when a jobs proxy expires while the job >> is in >> the queue. >> >> AFAICT, the Cream Proxy cleanup deletes the expired proxy from the cream >> proxy cache but does not cancel the job. So the job happily sits in the >> queue until it torque tries to run it, at which point it tries to >> stage in >> the proxy and of course fails. >> >> I think then it puts the job into waiting (W) state for a while before >> requeuing it and it then just oscillates between Q and W until you delete >> it. >> >> It's annoying but you're not losing jobs that would have run. >> >> Yours, >> Chris. >> >>> -----Original Message----- >>> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]] >>> On Behalf Of Jeff Templon >>> Sent: 22 February 2012 09:50 >>> To: [log in to unmask] >>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job >>> completion? >>> >>> Hi, >>> >>> we see these messages here too at Nikhef (giving up after 4 attempts) >>> and I cannot link them to jobs that have actually executed. which is >>> quite strange. >>> >>> this would indicate that somehow torque does not know that the job has >>> landed on a worker node, because there are no "start" or "end" records >>> that reference the WN in question in the torque server logs. however, >>> the job has somehow landed on a worker node, because the error message >>> 'unable to copy file' is coming from the mom on a worker node. >>> >>> This may be related to something we see here happen from time to time. >>> A job is in state "Q", but somehow it has a worker node assigned to it. >>> We usually delete these jobs when we see them, as there is no way we've >>> found to 'unassign' the worker node. >>> >>> >> JT >>> >>> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote: >>> >>>> Hi Leslie, >>>> >>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on >>> another) we've seen an error somehow close to what you describe here. >>> In our /var/log/messages we find tons of entires like this: >>>> >>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command >>>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError >>>> >>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_ >>>> >>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P >>>> >>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan >>>> dardError' failed with status=1, giving up after 4 attempts >>>> >>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy >>>> file err_cre02_780422084_StandardError to >>>> >>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_ >>>> >>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P >>>> >>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan >>>> dardError >>>> >>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command >>>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput >>>> >>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_ >>>> >>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P >>>> >>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan >>>> dardOutput' failed with status=1, giving up after 4 attempts F >>>> >>>> Have you checked whether these failed transfers are from cancelled >>> jobs? In our experience, it was always the case. >>>> >>>> We've also looked at ways to mitigate this annoying verbosity, but no >>> luck so far. The only option that we can think of is to stop using scp >>> for copies and move the sandbox to a shared area with the WNs, so you >>> use regular cp ($usecp directive) and these errors are hidden. But, of >>> course, this approach has its own disadvantages as well. >>>> >>>> Does anyone else have a better idea? >>>> >>>> Cheers, >>>> Miguel >>>> >>>> >>>> >>>> -- >>>> Miguel Gila >>>> CSCS Swiss National Supercomputing Centre HPC Solutions Via >>> Cantonale, >>>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch >>>> >>>> From: Leslie Groer >>>> <[log in to unmask]<mailto:[log in to unmask]>> >>>> Reply-To: LHC Computer Grid - Rollout >>>> <[log in to unmask]<mailto:[log in to unmask]>> >>>> Date: Thu, 16 Feb 2012 14:43:56 -0500 >>>> To:<[log in to unmask]<mailto:[log in to unmask]>> >>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion? >>>> >>>> scp: >>>> >>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Aso >>>> >>> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREA >>>> M619112035/StandardOutput: No such file or directory