Hi,
This is somewhat related to this question of software updates. It has
to deal with more dramatic updates, such as those requiring a reboot or
perhaps even a full re-install.
Once we get good at installing LCG software (or as it becomes easier to
install) and also manage to reduce the number of "head" nodes required
for an LCG site then perhaps the following system would work well:
1. Have two sets of head nodes. These would also be configured as
worker nodes. One set is "active" one is "secondary".
2. Initially the site uses just the "active" head nodes, and the
"secondary" ones run as normal worker nodes.
3. When an update requiring a reboot or some dramatic change which makes
old WNs incompatible with the new system, take down the "secondary" head
nodes (once they finish their active jobs), and install the new software
on them.
4. Now as WNs finish their active jobs they can be taken offline,
re-installed, and re-booted but now under control of the "secondary"
head nodes.
5. In this way the cluster would "swing" from one set of head nodes on
the old software to a new set of head nodes with the new software.
6. It might be possible to "cleanly" deal with job queues: ideally, the
queues could be transferred to the new system (but in practice I have no
idea how easy or hard this is), but otherwise if the queue wasn't too
deep it would be possible to close the queue on the old site, allocate a
fraction of the overall resources to remain on the old system in order
to drain the queue, and then the remaining nodes could start to move
across to the new system as soon as they finish their active jobs. The
nodes on the old system could then move across as soon as the "flush
queue" was empty and they were done active jobs.
7. When this is done, the "active" and "secondary" head nodes can swap,
and the new "secondary" head nodes can have their software updated (both
head node and worker node) and then come online as a worker node.
I would have thought that all "physical" sites could actually publish
two "logical" sites, and whichever is the "secondary" site could simply
advertise that it is not accepting jobs (unless it was in the process of
"swinging" and was ramping up with new WNs and therefore accepting
jobs). It would also mean some redundancy if the main head nodes went down.
Anyway, just my thought at how to keep resources up and running. It
doesn't seem very nice to have to entirely flush a queue, have the whole
site down for a day (or likely more) while the software update takes
place, and perhaps only then discover problems and further delays. It
means "best case" site utilisation would look something like (get ready
for the ASCII art):
------------- -----------------
\ /
\ /
\_______/
instead of:
------------- -------------------------
\ /
X
_____________/ \_________________________
and that isn't taking into account the effect of problems with the
deployment and configuration of the new software which the proposed
mechanism would (probably) isolate to at most a handful of nodes -- the
first to be installed.
Anyway, for all I know, people are already doing that.
Cheers,
Ian
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes
|