JISCMail - LCG-ROLLOUT Archives

Hi

I think I can guess what happened, but to be sure I would need the logs 
of cream (along with an id for a problematic job).

Can you please provide this info in a ggus ticket ?

Thanks, Massimo




On 02/22/2012 12:12 PM, Lukasz Flis wrote:
> Hi,
>
> We observe this behaviour for some months at CYFRONET-LCG2.
>
> We use cp instead of scp but the result is the same: jobs are hanging in
> W state being unable to complete file stagein procesdure as source files
> are deleted by CREAM
>
> Feb 22 12:04:42 <daemon.err> n3-1-2.local pbs_mom[]:
> LOG_ERROR::sys_copy, command '/bin/cp -rp
> /home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh
> CREAM336569863_jobWrapper.sh.35802.21066.1326919965' failed with
> status=1, giving up after 4 attempts
> Feb 22 12:04:42 <daemon.err> n3-1-2.local pbs_mom[]:
> LOG_ERROR::req_cpyfile, Unable to copy file
> /home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh
> to CREAM336569863_jobWrapper.sh.35802.21066.1326919965
> Feb 22 12:04:42 <daemon.err> n3-1-2.local pbs_mom[]:
> LOG_ERROR::req_cpyfile, Unable to copy file
> /home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh
> to CREAM336569863_jobWrapper.sh.35802.21066.1326919965 *** error from
> copy /bin/cp: cannot stat
> `/home/grid/sandbox/cream02/alicesgm/_O_GermanGrid_OU_DESY_CN_Antoni_Cyz_alice_Role_lcgadmin_Capability_NULL_sgmalice002/33/CREAM336569863/CREAM336569863_jobWrapper.sh':
> No such file or directory *** end error output
>
> It looks like not only proxyfile is missing but whole job sandbox.
> We haven't figured out the cause why cream is removing those files
> before job is actually launched yet
>
> Cheers
> --
> Lukasz Flis
>
>>
>> I believe this is what happens when a jobs proxy expires while the job
>> is in
>> the queue.
>>
>> AFAICT, the Cream Proxy cleanup deletes the expired proxy from the cream
>> proxy cache but does not cancel the job. So the job happily sits in the
>> queue until it torque tries to run it, at which point it tries to
>> stage in
>> the proxy and of course fails.
>>
>> I think then it puts the job into waiting (W) state for a while before
>> requeuing it and it then just oscillates between Q and W until you delete
>> it.
>>
>> It's annoying but you're not losing jobs that would have run.
>>
>> Yours,
>> Chris.
>>
>>> -----Original Message-----
>>> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
>>> On Behalf Of Jeff Templon
>>> Sent: 22 February 2012 09:50
>>> To: [log in to unmask]
>>> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
>>> completion?
>>>
>>> Hi,
>>>
>>> we see these messages here too at Nikhef (giving up after 4 attempts)
>>> and I cannot link them to jobs that have actually executed. which is
>>> quite strange.
>>>
>>> this would indicate that somehow torque does not know that the job has
>>> landed on a worker node, because there are no "start" or "end" records
>>> that reference the WN in question in the torque server logs. however,
>>> the job has somehow landed on a worker node, because the error message
>>> 'unable to copy file' is coming from the mom on a worker node.
>>>
>>> This may be related to something we see here happen from time to time.
>>> A job is in state "Q", but somehow it has a worker node assigned to it.
>>> We usually delete these jobs when we see them, as there is no way we've
>>> found to 'unassign' the worker node.
>>>
>>>
>> JT
>>>
>>> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote:
>>>
>>>> Hi Leslie,
>>>>
>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>> another) we've seen an error somehow close to what you describe here.
>>> In our /var/log/messages we find tons of entires like this:
>>>>
>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>> '/usr/bin/scp -rpB err_cre02_780422084_StandardError
>>>>
>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_
>>>>
>>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P
>>>>
>>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan
>>>> dardError' failed with status=1, giving up after 4 attempts
>>>>
>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>>>> file err_cre02_780422084_StandardError to
>>>>
>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_
>>>>
>>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P
>>>>
>>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan
>>>> dardError
>>>>
>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>> '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput
>>>>
>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_
>>>>
>>> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P
>>>>
>>> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan
>>>> dardOutput' failed with status=1, giving up after 4 attempts F
>>>>
>>>> Have you checked whether these failed transfers are from cancelled
>>> jobs? In our experience, it was always the case.
>>>>
>>>> We've also looked at ways to mitigate this annoying verbosity, but no
>>> luck so far. The only option that we can think of is to stop using scp
>>> for copies and move the sandbox to a shared area with the WNs, so you
>>> use regular cp ($usecp directive) and these errors are hidden. But, of
>>> course, this approach has its own disadvantages as well.
>>>>
>>>> Does anyone else have a better idea?
>>>>
>>>> Cheers,
>>>> Miguel
>>>>
>>>>
>>>>
>>>> --
>>>> Miguel Gila
>>>> CSCS Swiss National Supercomputing Centre HPC Solutions Via
>>> Cantonale,
>>>> Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch
>>>>
>>>> From: Leslie Groer
>>>> <[log in to unmask]<mailto:[log in to unmask]>>
>>>> Reply-To: LHC Computer Grid - Rollout
>>>> <[log in to unmask]<mailto:[log in to unmask]>>
>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>>>>
>>>> scp:
>>>>
>>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Aso
>>>>
>>> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREA
>>>> M619112035/StandardOutput: No such file or directory