The stageout issues take me back to the early days of LEP when CSF was
developing (yes, OK, I am old). Nodes started another job when the
previous one entered its output phase and was copying output across the
network. I don't think WMS can do much to push another job. This issue
is bigger than just staging the sandbox back to WMS, jobs often need to
send their data elsewhere and not everyone has async solutions for this.
The options I see are:-
a) local batch system - LSF can start more jobs when cpu load drops.
Obviously a risk if jobs stall at the start when stageing in. What can
other batch systems do?
b) Pilot jobs - obviously they can know enough to start another job at
the appropriate time but launching payloads other than serially
introduces opportunities for interference and difficulties in cleaning
up.
John
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Graeme Stewart
> Sent: 19 October 2008 11:01
> To: [log in to unmask]
> Subject: Re: [Fwd: Jobs idling on transfers..]
>
> On Sun, Oct 19, 2008 at 11:04 AM, Coles, J (Jeremy)
> <[log in to unmask]> wrote:
> > Hi Graeme
> >
> >>> Which VO are the jobs running under?
> >
> >>Unless I'm mistaken Kostas has pulled out code from the RB/WMS job
> >>epliogue wrapper. So the VO is not really relevant.
> >
> > I think it is relevant from a user education standpoint, rather than
> > simply one of catching inefficient jobs at the batch system.
>
> No it's not. If it's user education that would be teaching them "don't
> use the WMS, it's rubbish and it can't get your job outputs back to
> you..."
>
> :-)
>
> Graeme
>
> --
> Dr Graeme Stewart http://www.physics.gla.ac.uk/~graeme/
> Department of Physics and Astronomy, University of Glasgow, Scotland
--
Scanned by iCritical for STFC.
|