It will depend on your local batch system.
LSF has a command brequeue which allows a command which has recently
completed to be re-submitted as a new job.
There are some potential issues with jobs which perform persistent
updates such as databases since they may be restarted afer completing
the database update but without completing the CPU processing. Some
jobs should be run once and once only.
The approach we've used at CERN for potentially dangerous maintenance
windows is
- stop new job execution so during the maintenance no new jobs will
be started
- suspend running jobs so that any services which are affected will
not get new requests
- requeue any failed jobs
Some batch systems provide a checkpointing option. However, this is
very difficult to use for most programs which are not pure 'compute'
applications since they require linking with special libraries and not
using some sensitive system calls.
Tim
Antun Balaz wrote:
>Hi,
>our site AEGIS01-PHY-SCL is again up, after UPS upgrade.
>
>Thinking about UPSes, curious question: is there some way to save already
>running jobs if I know that power outage is coming? By saving I assume
>either to:
>
>1) save current jobs so that they can be restarted after power is on again,
>or
>
>2) save what is needed so that the jobs can be started from the beginning
>(same parameters, same JobID etc.), without user intervention.
>
>Regards, Antun
>
>-----
>E-mail: [log in to unmask]
>Web: http://scl.phy.bg.ac.yu/
>
>Phone: +381 11 3160260, Ext. 152
>Fax: +381 11 3162190
>
>Scientific Computing Laboratory
>Institute of Physics, Belgrade
>Serbia and Montenegro
>-----
>
>
|