Dimitris Zilaskos wrote:
>> I suppose it would be nice if jobs could get the equivalent of (or
>> exactly) a TERM signal and then be given a minute or two to tidy up as
>> best possible and exit themselves, and if not a KILL could be issued.
>
> I plan to let the rest of the jobs to timeout themselves. So I should
> rather use qsig -s SIGTERM to the one job that I forced to rerun?
In general, I am sure it is best to continue as you are now -- stop and
resubmit the job. Probably for "most people" this is the best thing to
do, but since I think LHCb still makes up about 90% of all LCG jobs,
from our perspective this is *not* the best thing to do -- our sweeper
process will probably see only the total time from:
A job started and ran for 18 hours
The job stalled for 12 hours
The job was resubmitted and restarted from scratch to run another 18 hours.
In total, this process will take 48 hours, and the sweeper *may* abort
the LCG job after 24 hours if it has not completed by then, not knowing
that it has been resubmitted and restarted. There are also all the
issues with proxy certificate expiry.
We automatically submit and re-submit jobs to LCG, so just aborting any
problematic LHCb jobs is better than a site resubmitting them. "best"
is just to let them timout themselves. Aborting means extra work for us
and aborted jobs don't feed any output back, so it makes it impossible
for us to see why it was stalled (this is a big pet peeve of ours for
LCG -- aborted jobs should still send back the output sandbox, or as
much of it as is available, for debugging purposes).
HTH,
Ian.
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes
|