On Mon, 28 Nov 2005, Valery Mitsyn wrote:
> >>> all possible cases on the Wiki with no result. About 10%
> >>> or so jobs were ended with "cannot get JobWrapper output...".
> >>> All of them were long running job, walltime > 24 hours.
> >>
> >>
> >> Are you sure they were not killed by the batch system?
> >
> > Are you sure those jobs actually had a long-lived proxy on myproxy.cern.ch?
> >
>
> From the torque point of view all jobs has been finished successfully.
>
> I'm not absolutely sure that myproxy.cern.ch server was involved
> and that it was long-lived proxy, I'm guessing that because
> there were jobs from LHCb DC which ended with this error.
> WNs at my site will spend 25+ hours to the jobs for LHCb
> and some time for CMS installation process too.
On Nov. 8 we looked into a CMS job that failed at your site: the job was
submitted OK, but then it disappeared from the batch system, which is taken
to mean that the job has finished. The RB then found that the job exit
status had not been communicated to the RB via any of two different methods,
hence the error message "Cannot read JobWrapper output, both from Condor and
from Maradona". The job was in fact still running, and later tried to copy
its output sandbox back to the RB, which failed because the job directory
had already been removed. I suppose this problem may also have happened
to the LHCb jobs. In any case, if you have no idea how the batch system
came to behave like this, next time a job fails with the infamous message,
please send us the EDG job ID and we will look into it.
|