On Mon, 28 Nov 2005, Maarten Litmaath, CERN wrote:
> On Mon, 28 Nov 2005, Valery Mitsyn wrote:
>
>>>>> all possible cases on the Wiki with no result. About 10%
>>>>> or so jobs were ended with "cannot get JobWrapper output...".
>>>>> All of them were long running job, walltime > 24 hours.
>>>>
>>>>
>>>> Are you sure they were not killed by the batch system?
>>>
>>> Are you sure those jobs actually had a long-lived proxy on myproxy.cern.ch?
>>>
>>
>> From the torque point of view all jobs has been finished successfully.
>>
>> I'm not absolutely sure that myproxy.cern.ch server was involved
>> and that it was long-lived proxy, I'm guessing that because
>> there were jobs from LHCb DC which ended with this error.
>> WNs at my site will spend 25+ hours to the jobs for LHCb
>> and some time for CMS installation process too.
>
> On Nov. 8 we looked into a CMS job that failed at your site: the job was
> submitted OK, but then it disappeared from the batch system, which is taken
> to mean that the job has finished. The RB then found that the job exit
> status had not been communicated to the RB via any of two different methods,
> hence the error message "Cannot read JobWrapper output, both from Condor and
> from Maradona". The job was in fact still running, and later tried to copy
> its output sandbox back to the RB, which failed because the job directory
> had already been removed. I suppose this problem may also have happened
> to the LHCb jobs. In any case, if you have no idea how the batch system
> came to behave like this, next time a job fails with the infamous message,
> please send us the EDG job ID and we will look into it.
>
Many thanks for digging into my problem.
As for a strange behaviour of torque at my site early in november,
there was a problem with our link to internet, whick leads a named
and ntpd crash on the LAN. Link should be better now.
Ok, I'll see how things will going now.
--
Best regards,
Valery Mitsyn
|