On Mon, Sep 15, 2008 at 05:06:09PM +0100, Santanu Das wrote:
> Hi Steve,
> >Deleting the jobs is okay and then when the support requests come let them
> >know what happened but its best if you contact the users in question
> >and try and find out
> >why they are not respecting your GlueCEStateStatus first.
> >
> That's what I did but that forces the site to reschedule the work plan,
> which sometimes brings other inconveniences in. I put out "down" and
> "Closed" last night for a day, assuming the site will be free in the
> morning, which did not happen. Now, we are actually
>
> 1. down for nothing for the entire day,
> 2. didn't able to do the thing I supposed to do today
> 3. plus I need to log another down time
> 4. and reschedule my to do list.
>
In addition to draining the queues (i.e., setting "set <queue> enabled =
False" in Torque) well in advance, we also place a reservation in Maui
so that no job will be running on the worker nodes (check the "setres"
command) at the time the downtime is scheduled.
The downside is that this is "too much" effective: since all jobs come
from the WMS/RB with no time requirements, they inherit the default
setting of the queue. With 72-hour queues as in YAIM default,
if you schedule a downtime for Monday morning, no new jobs will be
started since Friday... (We have a separate queue for "ops" with
a 1-hour time limit, which ensures SAM jobs can run until
the start of the downtime.)
Regards,
Riccardo
--
Riccardo Murri
CSCS - Swiss National Supercomputing Centre
http://cscs.ch/
tel: +41 91 610 8204
fax: +41 91 610 8282
|