Print

Print


It will depend on your local batch system.

LSF has a command brequeue which allows a command which has recently 
completed to be re-submitted as a new job. 

There are some potential issues with jobs which perform persistent 
updates such as databases since they may be restarted afer completing 
the database update but without completing the CPU processing.  Some 
jobs  should be run once and once only.

The approach we've used at CERN for potentially dangerous maintenance 
windows is

   - stop new job execution so during the maintenance no new jobs will 
be started
   - suspend running jobs so that any services which are affected will 
not get new requests
   - requeue any failed jobs

Some batch systems provide a checkpointing option.  However, this is 
very difficult to use for most programs which are not pure 'compute' 
applications since they require linking with special libraries and not 
using some sensitive system calls.

Tim

Antun Balaz wrote:

>Hi,
>our site AEGIS01-PHY-SCL is again up, after UPS upgrade.
>
>Thinking about UPSes, curious question: is there some way to save already 
>running jobs if I know that power outage is coming? By saving I assume 
>either to:
>
>1) save current jobs so that they can be restarted after power is on again, 
>or 
>
>2) save what is needed so that the jobs can be started from the beginning 
>(same parameters, same JobID etc.), without user intervention.
>
>Regards, Antun
>
>-----
>E-mail: [log in to unmask]
>Web: http://scl.phy.bg.ac.yu/
>
>Phone: +381 11 3160260, Ext. 152
>Fax: +381 11 3162190
>
>Scientific Computing Laboratory
>Institute of Physics, Belgrade
>Serbia and Montenegro
>-----
>  
>