Ian Stokes-Rees wrote:
>
>
> This is not a great solution for us (LHCb) -- although I've cc'ed the
> experts who may correct me. Our jobs are not atomic and self-contained.
> They interact with a job database and once they start we expect them
> to finish, or to let the job database know they have failed and are
> aborting. We have a "sweeper" process which looks for stale jobs which
> have mysteriously died, but (I believe) this then creates work for
> someone to manually restart/resubmit that job with the given parameters.
>
> The jobs should time out themselves so they can report that they have
> failed which will start an auto-resubmission. I understand this is not
> desireable from a site's point of view as it means nodes may sit idle
> while they wait for a blocked transfer to complete (network outage,
> power outage, overloaded node, node reboot, node crash, etc.), or to
> terminate due to timeout.
>
> I suppose it would be nice if jobs could get the equivalent of (or
> exactly) a TERM signal and then be given a minute or two to tidy up as
> best possible and exit themselves, and if not a KILL could be issued.
>
I plan to let the rest of the jobs to timeout themselves. So I should
rather use qsig -s SIGTERM to the one job that I forced to rerun?
Best regards,
--
============================================================================
Dimitris Zilaskos
Department of Physics @ Aristotle Univercity of Thessaloniki , Greece
PGP key : http://tassadar.physics.auth.gr/~dzila/pgp_public_key.asc
http://egnatia.ee.auth.gr/~dzila/pgp_public_key.asc
MD5sum : de2bd8f73d545f0e4caf3096894ad83f pgp_public_key.asc
============================================================================
|