On Mon, 28 Nov 2005 [log in to unmask] wrote:
> On Mon, 28 Nov 2005, Valery Mitsyn wrote:
>
> > >>> all possible cases on the Wiki with no result. About 10%
> > >>> or so jobs were ended with "cannot get JobWrapper output...".
> > >>> All of them were long running job, walltime > 24 hours.
> > >>
> > >>
> > >> Are you sure they were not killed by the batch system?
> > >
> > > Are you sure those jobs actually had a long-lived proxy on myproxy.cern.ch?
> > >
> >
> > From the torque point of view all jobs has been finished successfully.
> >
> > I'm not absolutely sure that myproxy.cern.ch server was involved
> > and that it was long-lived proxy, I'm guessing that because
> > there were jobs from LHCb DC which ended with this error.
> > WNs at my site will spend 25+ hours to the jobs for LHCb
> > and some time for CMS installation process too.
>
> On Nov. 8 we looked into a CMS job that failed at your site: the job was
> submitted OK, but then it disappeared from the batch system, which is taken
To clarify: "disappeared" means that "qstat" does not show the job (any more).
So, if qstat sometimes fails or times out, it would explain the problem.
> to mean that the job has finished. The RB then found that the job exit
> status had not been communicated to the RB via any of two different methods,
> hence the error message "Cannot read JobWrapper output, both from Condor and
> from Maradona". The job was in fact still running, and later tried to copy
> its output sandbox back to the RB, which failed because the job directory
> had already been removed. I suppose this problem may also have happened
> to the LHCb jobs. In any case, if you have no idea how the batch system
> came to behave like this, next time a job fails with the infamous message,
> please send us the EDG job ID and we will look into it.
>
>
|