On Fri, 4 May 2007, Alexander Piavka wrote:
> > Maybe we should not take this behavior as "likely to happen often,
> > so therefore we need a staging service on the glite-CE".
> > Such a service is non-trivial to write and has non-trivial consequences:
> > suddenly the CE becomes a data server,
>
> The the output sandbox can be temporary stored on closeSE
> instead on storing them temporary on gCE. It should not then be a big
> overhead for gCE to peridiocaly try to globus-url-copy from closeSE to
> WMS. And anyway this overload would happen only then there is a problem
> to copy to WMS, but would avoid waste of WN resources.
It makes the job state model more complicated. The batch system claims
the job is gone, but the output sandbox has not arrived: currently that
is considered a fatal error, whereas we would need to allow for a new
component running on the CE that could make things right. How much time
should we allow for that? Until the proxy expires? Could be a long time.
Meanwhile the user has no clue if the job will actually succeed or fail.
I am not saying it is impossible to design and implement such a facility,
but that it will have many unexpected consequences and complications,
which I argue outweigh the potential benefits.
> whereas until now the CE already
> > can get overloaded bringing simple WMS job wrappers into the batch system!
> >
> > > > Note that it is OK to retry a few hours for an output sandbox:
> > > > the WMS could be temporarily unreachable e.g. b/c of maintenance
> > > > of some network component.
> > > But due to this the whole(or a significant part of) batch sub-cluster will
> > > be wasted.
> >
> > It should not happen often, and the job gives up after about 5 hours.
>
> Maybe:
> __file_tx_retry_count=6 # will be set from an environment variable
> could be setup to 2 for sam tests only
Yes, that parameter should be part of the JDL. Please open a bug. :-)
|