Print

Print


Hi Winnie,

there's a file on the CEs called cluster.state and at least for the our setup (i.e. with SGE) setting this to Draining will do the job. At the same time, you need to declare a downtime for the CEs in the GOCDB.

After you done both (and the downtime has started), you kill all jobs in the queue. The LHC experiments will automatically resubmit elsewhere, for ILC you might want to consider asking.

Then just wait until the rest of the jobs finish, and you are home free.

As for nagios testing of the site-bdii, that's a bit tricky, as by construction there's only one site bdii, so  a spare one doesn't quite make sense, plus at least some of the security tests only run once or twice a day, so you might be waiting for a long time. Just stick it in and hope for the best :-D

Cheers,
Daniela


On 9 May 2014 14:05, Winnie Lacesso <[log in to unmask]> wrote:
Happy Friday!

Seeking more advice.

We have 2 CREAM-CEs that need to drain of jobs & be emi-3 SL6 rebuilt
(yes, late, sorry sorry)

Both have hundreds of jobs queued on them in long/med queues.

1. So first is to disable long/med submission & see if the queued jobs
will run, finish, thus drain the CE of long/med jobs (hopefully) "fast
enough".

Is disabling long/med job submission advertised in the GOC-DB by changing
CE status from PRODUCTION to, erm, NOT? (appears to be YES/NO only!)

Or, yaim-conf/services/glite-creamce on each contains
CREAM_CE_STATE=Production
Without having to rerun yaim, that value appears to be in
/etc/glite-ce-glue2/glite-ce-glue2.conf
and /var/lib/bdii/gip/ldif/static-file-CE.ldif (once per queue).
Happy to change these by hand, the files are not dynamically generated.

So, is a valid value "DRAINING"? - think so, I recall lcg-rollout Jan 2014
mentioned this:
> Hi Lukasz,
> I see that the queues are publishing the following value:
> GlueCEStateStatus: Draining
Question is, will CMS & LHCb job submission frameworks automatically note
that & not try, or will they try, complain / ticket us, till it's pointed
out... (maybe should contact them...)


2. It may take too long to allow the queued jobs to finish (want to get
the upgrade done ASAP). It is better / acceptable to contact the VOs with
many queued jobs (ILC, CMS & LHCb) & ask them to cancel the hundreds of
queued jobs?

Always grateful for advice!

Winnie Lacesso / Linux & Solaris Systems Administrator
HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
University of Bristol



--
Sent from the pit of despair

-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/