Hi all,
While reviewing my notes, I found an action dating back a couple of
weeks to document a process for using CE downtimes. The requirements are:
define/edit a maintenance period,
allow CE to drain prior to maintenance,
stop monitoring from happening during maintenance,
stop unwanted jobs from arriving during maintenance,
allow wanted jobs (tests) to arrive and
terminate maintenance period when maintenance is over.
The GOCDB provides functionality for much of this, but does not stop all
the unwanted jobs, nor does it provide a way to deal with tests of the
(new) CE. There are additional methods at our disposal; for example, we can:
set the BDII output to indicate that the CE is draining,
disable the CE from accepting jobs and
take the CE out of the site's BDII transmissions.
In the current version of cream, the first two (disable the CE and set
it to draining) are combined in one tool, glite-ce-disable-submissions,
although it may be possible to decouple them. There are a few ways of
combining these methods to realise the requirements to a greater or
lesser extent. Below are the few pros and cons of the most obvious
approaches. These remarks relate to the system as it is now, not what it
perhaps should be. They are presented in order of “least control” to
“most control”.
It would be good to discuss these proposals at an operations meeting.
Unfortunately, I'm off tomorrow on a long standing appointment. Someone
could always start the discussion and let me know the outcome?
Many thanks,
Steve
:------------------- APPROACH TRADE-OFFS ------------------------
:------------------------------------------
Option 1: Just put the downtime in the GOCDB
Pros: Very easy to do.
Cons: The monitoring system and (I believe) some submission frameworks
heed the GOCDB downtimes, but (AFAIK) the WMSs pay no heed to them. Thus
jobs continue to be transmitted to non-operational CEs, with chaotic
results. There is no way to deal with your test jobs.
:------------------------------------------
Option 2: Put downtime in GOCDB and use glite-ce-disable-submissions to
disable CE and set it to draining.
Pros: Still easy. The monitoring system and all submission methods heed
the downtimes and/or the glite-ce-disable-submissions command.
Cons: When the CE comes back up after a build, there could be a race
condition unless special measures are made to make it come up in a
“glite-ce-disable-submissions” state. The race condition could cause the
CE to toggle on/off as testing proceeds, with chaotic results. There is
still no way to deal with your test jobs, i.e. allow test jobs in while
rejecting all others.
:------------------------------------------
Option 3: Put downtime in GOCDB and take the CE out of the site BDII
transmissions (optionally, use use
“glite-ce-disable-submissions/enable-submissions” for finer granularity).
Pros: Full control. The monitoring system and all submission methods
heed the downtimes and/or removal of the BDII data. The race condition
is totally eliminated – it doesn't matter whether
“glite-ce-enable/disable-submissions” commands are issued during the
outage. Tests can be conducted (perhaps using
“glite-ce-disable-submissions/enable-submissions” ) while the CE is out
of the BDII transmissions, because the WMS can be commanded to send the
test jobs to your specific CE, side-stepping the site's BDII information.
Cons: Needs vi
:------------------------------------------
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|