On Wed, 30 Nov 2005, Valery Mitsyn wrote:
> On Mon, 28 Nov 2005, Maarten Litmaath, CERN wrote:
>
> > On Mon, 28 Nov 2005 [log in to unmask] wrote:
> >
> >> On Mon, 28 Nov 2005, Valery Mitsyn wrote:
> >>
> >>>>>> all possible cases on the Wiki with no result. About 10%
> >>>>>> or so jobs were ended with "cannot get JobWrapper output...".
> >>>>>> All of them were long running job, walltime > 24 hours.
> >>>>>
> >>>>>
> >>>>> Are you sure they were not killed by the batch system?
> >>>>
> >>>> Are you sure those jobs actually had a long-lived proxy on myproxy.cern.ch?
> >>>>
> >>>
> >>> From the torque point of view all jobs has been finished successfully.
> >>>
> >>> I'm not absolutely sure that myproxy.cern.ch server was involved
> >>> and that it was long-lived proxy, I'm guessing that because
> >>> there were jobs from LHCb DC which ended with this error.
> >>> WNs at my site will spend 25+ hours to the jobs for LHCb
> >>> and some time for CMS installation process too.
> >>
> >> On Nov. 8 we looked into a CMS job that failed at your site: the job was
> >> submitted OK, but then it disappeared from the batch system, which is taken
> >
> > To clarify: "disappeared" means that "qstat" does not show the job (any more).
> > So, if qstat sometimes fails or times out, it would explain the problem.
>
> Good idea! I think I have it tracked down. I've installed new
> version of torque, they have changed qstat -f output format (a bit ;-)).
> Insteed of "Job Id:" in the first line for every jobs, they print "job:"
> now. I've edit lcgpbs.pm and things going much better now. Another
> solution would be revert to pbs.pm in globus.
> He-he, will see...
Good job! :-)
Feel free to open a bug in Savannah about this. It seems easy to have the
job manager support both syntaxes.
> "Job Id:" in the LCG version , in torque-2.0.0p3
> >
> >> to mean that the job has finished. The RB then found that the job exit
> >> status had not been communicated to the RB via any of two different methods,
> >> hence the error message "Cannot read JobWrapper output, both from Condor and
> >> from Maradona". The job was in fact still running, and later tried to copy
> >> its output sandbox back to the RB, which failed because the job directory
> >> had already been removed. I suppose this problem may also have happened
> >> to the LHCb jobs. In any case, if you have no idea how the batch system
> >> came to behave like this, next time a job fails with the infamous message,
> >> please send us the EDG job ID and we will look into it.
> >>
> >>
> >
>
>
|