Hi,
On the large end-user (500node+) desktop systems I've looked after we
developede two seperate sets of tools:
* a set of maintenance scripts that get executed on system boot and
periodically via cron
* a remote control tool for immediate ad-hoc changes to sets of
machines.
The maintenance scripts are responsible from taking a box which has (at
least) a minimal bootstrap image (ie a filesystem, init, glibc, perl,
and other basics) and configuring it for use. This includes package
installation/upgrade, user account creation, kerberos configuration,
service configuration etc. Configuration data is stored in a central
PostgreSQL database or on-disk files exported via NFS.
The package manager tool used to use a locally developed solution, but
was replaced with apt-rpm when it became available/stable. We maintain
a local package repository and add updates when we've tested them and
are satisfied they work.
The maint scripts, whilst also reponsible for most of the original
installation, are also used to maintain the current state of
already-installed machines -- checking that local passwd and group files
are up-to-date, updating any packages available for installation, etc.
The maint scripts can be as simple or as complicated as you like.
Our remote control tool is also useful, but fills a different niche --
its function is to execute any arbitrary command on any arbitrary subset
of hosts quickly and securely via SSH. User authentication is automated
using Kerberos (even as root), but public/private keypairs should also
work nicely. My tool invokes commands across hosts in parallel, which
is useful for large numbers of hosts.
(The script I wrote to do this is available online under the GPL; see
http://www.doc.ic.ac.uk/~dwm/Code/auto-checkout/remote--main/remote.pl)
This tool is very useful when you need to make a single operational
change quickly; for example, if:
* The air-con's failed in the undergrad lab and you need to shutdown all
the even-numbered machines to reduce the temperature. (`remote
'lab.*[02468]$'`)
* There's an urgent security fix that needs to be rolled out, but the
maint scripts won't fire until tomorrow. (`remote . maint --scr=apt`)
* You want to find out where a user is physically sitting because
they're late for an exam (`remote . "ps auxf > ~/tmp/`hostname`"`)
In addition to all of this, they're also looking at live monitoring
tools, watchdogs, that are responsible for continuous local environment
checking. Things like monitoring the CPU temperature, checking that NFS
mounts are intact and valid, etc. But this is just an idea in the works
at the moment.
I don't look after our current production farms, but I understand our
admin uses similar ideas -- boot from network server by default,
standard install scripts that setup/refresh the local environment,
remote update of userlists/gridmapfiles via SSH, etc.
Hope this helps..
Cheers,
David
--
David McBride <[log in to unmask]>
Department of Computing, Imperial College, London
|