Hi,
I believe this is what happens when a jobs proxy expires while the job is in
the queue.
AFAICT, the Cream Proxy cleanup deletes the expired proxy from the cream
proxy cache but does not cancel the job. So the job happily sits in the
queue until it torque tries to run it, at which point it tries to stage in
the proxy and of course fails.
I think then it puts the job into waiting (W) state for a while before
requeuing it and it then just oscillates between Q and W until you delete
it.
It's annoying but you're not losing jobs that would have run.
Yours,
Chris.
> -----Original Message-----
> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
> On Behalf Of Jeff Templon
> Sent: 22 February 2012 09:50
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] Torque/CREAM race condition at job
> completion?
>
> Hi,
>
> we see these messages here too at Nikhef (giving up after 4 attempts)
> and I cannot link them to jobs that have actually executed. which is
> quite strange.
>
> this would indicate that somehow torque does not know that the job has
> landed on a worker node, because there are no "start" or "end" records
> that reference the WN in question in the torque server logs. however,
> the job has somehow landed on a worker node, because the error message
> 'unable to copy file' is coming from the mom on a worker node.
>
> This may be related to something we see here happen from time to time.
> A job is in state "Q", but somehow it has a worker node assigned to it.
> We usually delete these jobs when we see them, as there is no way we've
> found to 'unassign' the worker node.
>
>
JT
>
> On Feb 22, 2012, at 09:20 , Gila Arrondo Miguel Angel wrote:
>
> > Hi Leslie,
> >
> > With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
> another) we've seen an error somehow close to what you describe here.
> In our /var/log/messages we find tons of entires like this:
> >
> > Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
> > '/usr/bin/scp -rpB err_cre02_780422084_StandardError
> >
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_
> >
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P
> >
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan
> > dardError' failed with status=1, giving up after 4 attempts
> >
> > Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
> > file err_cre02_780422084_StandardError to
> >
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_
> >
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P
> >
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan
> > dardError
> >
> > Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
> > '/usr/bin/scp -rpB out_cre02_780422084_StandardOutput
> >
> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_
> >
> cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_P
> >
> ilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/Stan
> > dardOutput' failed with status=1, giving up after 4 attempts F
> >
> > Have you checked whether these failed transfers are from cancelled
> jobs? In our experience, it was always the case.
> >
> > We've also looked at ways to mitigate this annoying verbosity, but no
> luck so far. The only option that we can think of is to stop using scp
> for copies and move the sandbox to a shared area with the WNs, so you
> use regular cp ($usecp directive) and these errors are hidden. But, of
> course, this approach has its own disadvantages as well.
> >
> > Does anyone else have a better idea?
> >
> > Cheers,
> > Miguel
> >
> >
> >
> > --
> > Miguel Gila
> > CSCS Swiss National Supercomputing Centre HPC Solutions Via
> Cantonale,
> > Galleria 2 | CH-6928 Manno | Switzerland Miguel.gila [at] cscs.ch
> >
> > From: Leslie Groer
> > <[log in to unmask]<mailto:[log in to unmask]>>
> > Reply-To: LHC Computer Grid - Rollout
> > <[log in to unmask]<mailto:[log in to unmask]>>
> > Date: Thu, 16 Feb 2012 14:43:56 -0500
> > To: <[log in to unmask]<mailto:[log in to unmask]>>
> > Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
> >
> > scp:
> >
> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Aso
> >
> ka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREA
> > M619112035/StandardOutput: No such file or directory
|