Hi Leslie,
With a very similar setup (Moab/Torque on 1 host and CREAM-CE on another) we've seen an error somehow close to what you describe here. In our /var/log/messages we find tons of entires like this:
Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB err_cre02_780422084_StandardError [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError' failed with status=1, giving up after 4 attempts
Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file err_cre02_780422084_StandardError to [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError
Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput' failed with status=1, giving up after 4 attempts
F
Have you checked whether these failed transfers are from cancelled jobs? In our experience, it was always the case.
We've also looked at ways to mitigate this annoying verbosity, but no luck so far. The only option that we can think of is to stop using scp for copies and move the sandbox to a shared area with the WNs, so you use regular cp ($usecp directive) and these errors are hidden. But, of course, this approach has its own disadvantages as well.
Does anyone else have a better idea?
Cheers,
Miguel
--
Miguel Gila
CSCS Swiss National Supercomputing Centre
HPC Solutions
Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
Miguel.gila [at] cscs.ch
From: Leslie Groer <[log in to unmask]<mailto:[log in to unmask]>>
Reply-To: LHC Computer Grid - Rollout <[log in to unmask]<mailto:[log in to unmask]>>
Date: Thu, 16 Feb 2012 14:43:56 -0500
To: <[log in to unmask]<mailto:[log in to unmask]>>
Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
scp: /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput: No such file or directory
|