On Thu, 3 May 2007 [log in to unmask] wrote:
> On Thu, 3 May 2007, Maarten Litmaath, CERN wrote:
>
> > Note that it is OK to retry a few hours for an output sandbox:
> > the WMS could be temporarily unreachable e.g. b/c of maintenance
> > of some network component.
> >
> > > If InputSandbox can't be stageout on first and maybe second try, then
> > > the job would be just aborted, since it has not yet started to make any
> > > computations, so it won't be much pity to abort such job.
> >
> > Correct. Please open a bug in Savannah.
>
> On second thought I am not so sure the current behavior is wrong:
> also for the input sandbox it gives up after about 5 hours.
>
> The idea is that a job should not give up too quickly,
> once it managed to get to the head of the queue!
>
> The exact numbers to be used may be debated.
>
Maybe the JobWrapper can be changed to checkpoint itself if it fails to
stagein the input-sanbox and then it fails to stageout output-sandbox.
Then it would later be rerun by the pbs from a checkpointed stage.
Alex
|