Hi everyone,
Dimitris Zilaskos wrote:
> I have forced one of the jobs to rerun and placed it on hold, leaving
> once cpu free for other tasks. Is there any better solution?
This is not a great solution for us (LHCb) -- although I've cc'ed the
experts who may correct me. Our jobs are not atomic and self-contained.
They interact with a job database and once they start we expect them
to finish, or to let the job database know they have failed and are
aborting. We have a "sweeper" process which looks for stale jobs which
have mysteriously died, but (I believe) this then creates work for
someone to manually restart/resubmit that job with the given parameters.
The jobs should time out themselves so they can report that they have
failed which will start an auto-resubmission. I understand this is not
desireable from a site's point of view as it means nodes may sit idle
while they wait for a blocked transfer to complete (network outage,
power outage, overloaded node, node reboot, node crash, etc.), or to
terminate due to timeout.
I suppose it would be nice if jobs could get the equivalent of (or
exactly) a TERM signal and then be given a minute or two to tidy up as
best possible and exit themselves, and if not a KILL could be issued.
Ian
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://grid.physics.ox.ac.uk/~stokes
|