Hi Maarten, On Thu, 3 May 2007 [log in to unmask] wrote: > On Thu, 3 May 2007, Alexander Piavka wrote: > > > I think that if OutputSandbox can't be stageout back to WMS on the first > > try, the gCE should take care of this instead of JobWrapper on WN. > > That would be an extra complication to work around a problem that > should not be present in the first place. We need to understand > why globus-url-copy is hanging and fix that. But globus-url-copy is not hanging, it just failes to connect to WMS or something similar. There is no way to avoid that. And then this happens with a widely used WMS , a huge amount of CPU time is wasted, so afterall it is probably worth the extra complications to gain more scalability. As with current situation: looking at SEE ROC gCE SAM tests , it looks like all sites are currently having such problem, since they all have a mix of failure and success SAM tests. > > Note that it is OK to retry a few hours for an output sandbox: > the WMS could be temporarily unreachable e.g. b/c of maintenance > of some network component. But due to this the whole(or a significant part of) batch sub-cluster will be wasted. If the gCE will take care of stageout (only in cases then the WN failed to do so on a first try) no WN resources will be wasted, otherwize some workaround should be implemented in the pbs/maui configuration to allow more jobs submitions to WN with less than $ideal_load. > > > If InputSandbox can't be stageout on first and maybe second try, then > > the job would be just aborted, since it has not yet started to make any > > computations, so it won't be much pity to abort such job. > > Correct. Please open a bug in Savannah. Alex