Hi,
I am using torque/maui combination. So if poweroff is imminent, can I apply
the following procedure:
[say we have only one queue (thequeue), and one running job in it (with job
ID equals JOBID)]
CE:
qdisable thequeue
mjobctl -R JOBID
All nodes:
restart (including poweroff)
CE:
qenable thequeue
Will the job JOBID survive and start execution again from the beginning, and
will the user be able to fetch the output using edg-job-get-output once tha
job is completed?
I would not like to try this on (still) alive jobs :))
Thanks, Antun
-----
E-mail: [log in to unmask]
Web: http://scl.phy.bg.ac.yu/
Phone: +381 11 3160260, Ext. 152
Fax: +381 11 3162190
Scientific Computing Laboratory
Institute of Physics, Belgrade
Serbia and Montenegro
-----
---------- Original Message -----------
From: Tim Bell <[log in to unmask]>
To: [log in to unmask]
Sent: Tue, 27 Sep 2005 21:32:14 +0200
Subject: Re: [LCG-ROLLOUT] Saving jobs while running?
> It will depend on your local batch system.
>
> LSF has a command brequeue which allows a command which has recently
> completed to be re-submitted as a new job.
>
> There are some potential issues with jobs which perform persistent
> updates such as databases since they may be restarted afer
> completing the database update but without completing the CPU
> processing. Some jobs should be run once and once only.
>
> The approach we've used at CERN for potentially dangerous
> maintenance windows is
>
> - stop new job execution so during the maintenance no new jobs
> will be started - suspend running jobs so that any services which
> are affected will not get new requests - requeue any failed jobs
>
> Some batch systems provide a checkpointing option. However, this is
> very difficult to use for most programs which are not pure 'compute'
> applications since they require linking with special libraries and
> not using some sensitive system calls.
>
> Tim
>
> Antun Balaz wrote:
>
> >Hi,
> >our site AEGIS01-PHY-SCL is again up, after UPS upgrade.
> >
> >Thinking about UPSes, curious question: is there some way to save already
> >running jobs if I know that power outage is coming? By saving I assume
> >either to:
> >
> >1) save current jobs so that they can be restarted after power is on
again,
> >or
> >
> >2) save what is needed so that the jobs can be started from the beginning
> >(same parameters, same JobID etc.), without user intervention.
> >
> >Regards, Antun
> >
> >-----
> >E-mail: [log in to unmask]
> >Web: http://scl.phy.bg.ac.yu/
> >
> >Phone: +381 11 3160260, Ext. 152
> >Fax: +381 11 3162190
> >
> >Scientific Computing Laboratory
> >Institute of Physics, Belgrade
> >Serbia and Montenegro
> >-----
> >
> >
------- End of Original Message -------
|