Hi,
we see these messages here too at Nikhef (giving up after 4 attempts) and I cannot link them to jobs that have actually executed. which is quite strange.
this would indicate that somehow torque does not know that the job has landed on a worker node, because there are no "start" or "end" records that reference the WN in question in the torque server logs. however, the job has somehow landed on a worker node, because the error message 'unable to copy file' is coming from the mom on a worker node.
This may be related to something we see here happen from time to time. A job is in state "Q", but somehow it has a worker node assigned to it. We usually delete these jobs when we see them, as there is no way we've found to 'unassign' the worker node.
JT
On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote:
> Hi Leslie,
>
> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on another) we've seen an error somehow close to what you describe here. In our /var/log/messages we find tons of entires like this:
>
> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB err_cre02_780422084_StandardError [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError' failed with status=1, giving up after 4 attempts
>
> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file err_cre02_780422084_StandardError to [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError
>
> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput' failed with status=1, giving up after 4 attempts
> F
>
> Have you checked whether these failed transfers are from cancelled jobs? In our experience, it was always the case.
>
> We've also looked at ways to mitigate this annoying verbosity, but no luck so far. The only option that we can think of is to stop using scp for copies and move the sandbox to a shared area with the WNs, so you use regular cp ($usecp directive) and these errors are hidden. But, of course, this approach has its own disadvantages as well.
>
> Does anyone else have a better idea?
>
> Cheers,
> Miguel
>
>
>
> --
> Miguel Gila
> CSCS Swiss National Supercomputing Centre
> HPC Solutions
> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
> Miguel.gila [at] cscs.ch
>
> From: Leslie Groer <[log in to unmask]<mailto:[log in to unmask]>>
> Reply-To: LHC Computer Grid - Rollout <[log in to unmask]<mailto:[log in to unmask]>>
> Date: Thu, 16 Feb 2012 14:43:56 -0500
> To: <[log in to unmask]<mailto:[log in to unmask]>>
> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>
> scp: /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput: No such file or directory
|