On Wed, Jan 05, 2005 at 02:42:15PM -0000 or thereabouts, Ian Stokes-Rees wrote:
> Hi everyone,
>
> Dimitris Zilaskos wrote:
> > I have forced one of the jobs to rerun and placed it on hold, leaving
> > once cpu free for other tasks. Is there any better solution?
>
> This is not a great solution for us (LHCb) -- although I've cc'ed the
> experts who may correct me. Our jobs are not atomic and self-contained.
> They interact with a job database and once they start we expect them
> to finish, or to let the job database know they have failed and are
> aborting. We have a "sweeper" process which looks for stale jobs which
> have mysteriously died, but (I believe) this then creates work for
> someone to manually restart/resubmit that job with the given parameters.
>
> The jobs should time out themselves so they can report that they have
> failed which will start an auto-resubmission. I understand this is not
> desireable from a site's point of view as it means nodes may sit idle
> while they wait for a blocked transfer to complete (network outage,
> power outage, overloaded node, node reboot, node crash, etc.), or to
> terminate due to timeout.
Hi Ian,
And from your point of view as well, you have an allocation based on
wall time here. You will be using that allocation while waiting. At
the moment this is a non issue since LHCb is still the only one submitting
enough fast enough. You are currently using 30 times your allocation
anyway at RAL. If another group can submit fast enough then you will
be squashed down.
Writing a sweeper to replicate files not at CERN to CERN would seem
as sensible thing to do?
Steve
>
> I suppose it would be nice if jobs could get the equivalent of (or
> exactly) a TERM signal and then be given a minute or two to tidy up as
> best possible and exit themselves, and if not a KILL could be issued.
>
> Ian
>
> --
> Ian Stokes-Rees [log in to unmask]
> Particle Physics, Oxford http://grid.physics.ox.ac.uk/~stokes
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|