On Thu, 3 May 2007 [log in to unmask] wrote: > Hi Alex, > > > On Thu, 3 May 2007 [log in to unmask] wrote: > > > > > On Thu, 3 May 2007, Alexander Piavka wrote: > > > > > > > I think that if OutputSandbox can't be stageout back to WMS on the first > > > > try, the gCE should take care of this instead of JobWrapper on WN. > > > > > > That would be an extra complication to work around a problem that > > > should not be present in the first place. We need to understand > > > why globus-url-copy is hanging and fix that. > > But globus-url-copy is not hanging, it just failes to connect to WMS or > > something similar. There is no way to avoid that. And then this happens > > with a widely used WMS , a huge amount of CPU time is wasted, so afterall > > it is probably worth the extra complications to gain more scalability. As > > with current situation: looking at SEE ROC gCE SAM tests , it looks like > > all sites are currently having such problem, since they all have a mix of > > failure and success SAM tests. > > We have started looking into those failures. We noticed that they > all refer to rb108, whereas half of the SAM glite-CE tests use rb118. > We do not know the cause yet, but something started going wrong on Sunday. > Maybe we should not take this behavior as "likely to happen often, > so therefore we need a staging service on the glite-CE". > Such a service is non-trivial to write and has non-trivial consequences: > suddenly the CE becomes a data server, The the output sandbox can be temporary stored on closeSE instead on storing them temporary on gCE. It should not then be a big overhead for gCE to peridiocaly try to globus-url-copy from closeSE to WMS. And anyway this overload would happen only then there is a problem to copy to WMS, but would avoid waste of WN resources. whereas until now the CE already > can get overloaded bringing simple WMS job wrappers into the batch system! > > > > Note that it is OK to retry a few hours for an output sandbox: > > > the WMS could be temporarily unreachable e.g. b/c of maintenance > > > of some network component. > > But due to this the whole(or a significant part of) batch sub-cluster will > > be wasted. > > It should not happen often, and the job gives up after about 5 hours. Maybe: __file_tx_retry_count=6 # will be set from an environment variable could be setup to 2 for sam tests only Alex