JISCMail - LCG-ROLLOUT Archives

On Thu, 3 May 2007 [log in to unmask] wrote:

> Hi Alex,
>
> > On Thu, 3 May 2007 [log in to unmask] wrote:
> >
> > > On Thu, 3 May 2007, Alexander Piavka wrote:
> > >
> > > >  I think that if OutputSandbox can't be stageout back to WMS on the first
> > > > try, the gCE should take care of this instead of JobWrapper on WN.
> > >
> > > That would be an extra complication to work around a problem that
> > > should not be present in the first place.  We need to understand
> > > why globus-url-copy is hanging and fix that.
> >  But globus-url-copy is not hanging, it just failes to connect to WMS or
> > something similar. There is no way to avoid that. And then this happens
> > with a widely used WMS , a  huge amount of CPU time is wasted, so afterall
> > it is probably worth the extra complications to gain more scalability. As
> > with current situation: looking at SEE ROC gCE SAM tests , it looks like
> > all sites are currently  having such problem, since they all have a mix of
> > failure and success SAM tests.
>
> We have started looking into those failures.  We noticed that they
> all refer to rb108, whereas half of the SAM glite-CE tests use rb118.
> We do not know the cause yet, but something started going wrong on Sunday.
> Maybe we should not take this behavior as "likely to happen often,
> so therefore we need a staging service on the glite-CE".
> Such a service is non-trivial to write and has non-trivial consequences:
> suddenly the CE becomes a data server,

The the output sandbox can be temporary stored on closeSE
instead on storing them temporary on gCE. It should not then be a big
overhead for gCE to peridiocaly try to globus-url-copy from closeSE to
WMS. And anyway this overload would happen only then there is a problem
to copy to WMS, but would avoid waste of WN resources.

 whereas until now the CE already
> can get overloaded bringing simple WMS job wrappers into the batch system!
>
> > > Note that it is OK to retry a few hours for an output sandbox:
> > > the WMS could be temporarily unreachable e.g. b/c of maintenance
> > > of some network component.
> >  But due to this the whole(or a significant part of) batch sub-cluster will
> > be wasted.
>
> It should not happen often, and the job gives up after about 5 hours.

 Maybe:
__file_tx_retry_count=6 # will be set from an environment variable
could be setup to 2 for sam tests only

 Alex