JISCMail - LCG-ROLLOUT Archives

 Hi Maarten,

On Thu, 3 May 2007 [log in to unmask] wrote:

> On Thu, 3 May 2007, Alexander Piavka wrote:
>
> >  I think that if OutputSandbox can't be stageout back to WMS on the first
> > try, the gCE should take care of this instead of JobWrapper on WN.
>
> That would be an extra complication to work around a problem that
> should not be present in the first place.  We need to understand
> why globus-url-copy is hanging and fix that.
 But globus-url-copy is not hanging, it just failes to connect to WMS or
something similar. There is no way to avoid that. And then this happens
with a widely used WMS , a  huge amount of CPU time is wasted, so afterall
it is probably worth the extra complications to gain more scalability. As
with current situation: looking at SEE ROC gCE SAM tests , it looks like
all sites are currently  having such problem, since they all have a mix of
failure and success SAM tests.

>
> Note that it is OK to retry a few hours for an output sandbox:
> the WMS could be temporarily unreachable e.g. b/c of maintenance
> of some network component.
 But due to this the whole(or a significant part of) batch sub-cluster will
be wasted. If the gCE will take care of stageout (only in
cases then the WN failed to do so on a first try) no WN resources will be
wasted, otherwize some workaround should be implemented in the pbs/maui
configuration to allow more jobs submitions to WN with less than $ideal_load.

>
> > If InputSandbox can't be stageout on first and maybe second try, then
> > the job would be just aborted, since it has not yet started to make any
> > computations, so it won't be much pity to abort such job.
>
> Correct.  Please open a bug in Savannah.

 Alex