>> I am just looking around to find what is the commonly used VM
>> Hypervisor in GridPP group (i.e Xen, KVM or etc)
> KVM without a shadow of a doubt. It works, it's easy, and it's
> in SL5 as standard issue. You'd need a good positive reason to
> go for anything else these days.
I agree with this. I would however say that while KVM is good,
but in other contexts I have had good experiences with Xen, with
a paravirtualized kernel, as paravirtualization can have reduced
overheads compared to virtualization (even with AMD or Intel hw
virtualization assist).
Many WLCG host types are more network and disk intensive, and VMs
work better for memory and CPU oriented workloads that do not
involve device virtualization. Xen also allows moving VM images
around, which may be useful (it was designed to implement the
XenoServers "cloud").
As to other proper VM systems a pet peeve is that having had to
deal with a system based on VMware Server 2.0/GSX I found that I
really dislike it (buggy, big limitations, high overheads, no
longer maintained that much). At the time it was installed there
was little else though. I had better previous experiences with
VMware ESX, but I think that the VMware sysmaint infrastructure
is annoying to use (I preferred editing ".vmx" files directy),
but then I don't particularly like GUIs.
As to VM overheads with VMware GSX this for example is a year of
CPU overheads *inside* (the host has additional overheads...) a
GSX VM for an LCG CE:
http://ganglia.dur.scotgrid.ac.uk/ganglia/graph.php?g=cpu_report&z=large&c=Grid%20Servers&h=ce02.dur.scotgrid.ac.uk&m=&r=year&s=descending&hc=4&st=1316641054
However, given the relatively small number of hosts and host
types in a T2 however I would strongly prefer for a new setup to
just buy a number of smaller, low power-draw real machines.
Lots, lots simpler to deal with and less buggy and much lower
overheads, and nearly all WLCG host types are far from CPU
intensive (and most are not even RAM bound). But then I am very
skeptical as to the usefulness of VM setups in general (while
they can be very useful in special cases), so perhaps this is
just my prejudice.
While for a WLCG site the middleware people heavily discourage
running multiple host types on the same physical host, the number
of those is not really that huge, and there are hw products that
have 2 or even 4 real machines in a 1U pizzabox (this is driven
by webhosting).
An alternative that I considered was switching to something else
like http://linux-vserver.org/ which virtualizes/partitions user
space into contexts/containers/zones (for those unfamiliar: a
kind of generalized 'chroot'), and has nugatory or negative
overheads, and a lot less opportunities for bugs, and seems to
map very well onto WLCG host type issues, and I have had very
good experiences with it in the past (as long as they can share
the running kernel, it even supports running many different
distributions on the same host).
>> Does any use any tool to manage these VM [ ... ]
> We use the libvirt/virsh/Virtual Machine Manager tools. [ ... ]
> definitely don't want to be starting kvm itself on the
> commandline manually - the libvirt stuff is the right level of
> indirection.
I think that VMs are really simple things, and one needs tools
only for very many. Dealing with half a dozen VMs it seemed
easier to just edit the VM configuration files manually and start
and stop the VM process manually too.
>> For all these stuff, planing to use a machine with spec
>> Intel(R) Xeon(R) E5345 @ 2.33GHz , 8 core, 16Gb RAM Just
>> wondering how many VM can be setup on this machine considering
>> the instance we may run on it. - two EMI cream - sbdii - apel
>> - argus
Sounds reasonable, and I had similar hosts running pretty
comfortably 4 VMs each. Some hosts types like SBDII and APEL are
really small (SBDII is one LDAP daemon serving a few KiB of data
total, APEL is one Java log summarizer running once a day). So I
ended up running SBDII on the UI, as that seems one of the host
type combinations for which I can't see any problem running two
host types on the same host, and I was tempted to do the same for
APEL. I would guess that running Torque on the same physical host
as might work well too, and I was very tempted to do that, as in
general the host types that do not depend much on middleware
libraries should be able to share a host. To be confirmed :-).
> [ ... ] CREAM CEs, in our experience, will want about 6Gb of
> RAM each, so two of those, plus say 1GB for each of the others,
> totals 15GB, and leaves you a little over for the host OS.
That seems reasonable to me, but I found to my surprise that the
CREAM CE was lighter than the LCG one, and 2GiB seemed adequate:
http://ganglia.dur.scotgrid.ac.uk/ganglia/graph.php?g=mem_report&z=large&c=Grid%20Servers&h=cream02.dur.scotgrid.ac.uk&m=&r=year&s=descending&hc=4&st=1316641054
But like the LCG CE it seems to need restarting periodically
because of climbing memory usage with time (same with a few other
daemon based services, and I would restart servers every 1-3
months "just in case" :->).
> [ ... ] pay some attention to the speed of the disks and the
> amount of random IO they can handle. A locally attached array
> of 15k SAS disks (i.e. a Dell R510 disk server) is one approach,
That's a very good point, and R510s are nice. One configuration
that I liked and seemed good value was 2x 15k SAS and 2x or 4x or
6x 10k SAS (or 10k/7.2k "enterprise SATA" with ERC). BTW I much
prefer using Linux MD to hw RAID for several reasons, among them
the ability to move disks to boxes with different hw IO cards,
and I had terrible experiences with some 3ware cards, and other
people with many other types of RAID cards (as previously
mentioned in this list).
I had to deal with a setup with VM images on NFS, with several
virtual disks allocated as growable, and I found that was (very)
painful on the non-grid side. On the grid side it worked better,
but it was still something that I would not have done.
One of the major issues was backups: with backups running, that
is tree-walking (RSYNC) inside the VMs the peak load (especially
IOPS) climbs much higher than average, and VM overheads (RSYNC
networking and RSYNC reading heavily) can be huge.
> we have our VMs storage on an old 14 drive supermicro disk
> server, and that seems able to cope too (it's not running the
> VMs though, so all its memory is disk cache).
If that is a SAN with virtual disks allocated as chunks of the
SAN it seems viable to me, if it is a NAS (NFS) it seems a lot
less of a good idea.
If one has to use NFS I liked as a workaround to put the relevant
subtree on NFS, and mount it inside the VM, instead of putting it
inside a virtual disk image and accesidn the virtual image over
NFS. This allows backing up the tree without going through the VM
overhead, and often network VM overheads are less expensive than
virtual disk ones (and there are other reasons).
So where possible I had small virtual disk images (4GiB, so
relatively quick to backup/duplicate as a whole, after making it
quiet) containing just the OS, and all data mounted via NFS,
*even from the same host* (that is, the NFS server was the VM
host itself, and traffic went over 'lo'). Not optimal, but
better, as the VM's only virtual disk accesses are then almost
only syslogging.
> typical small server setup of two basic SATA disks in a RAID
> mirror though, you won't have enough IO capacity to go round,
> particularly for the CREAM CEs.
In my experience that actually sort of worked, choosing nice
disks and a nice compact layout, but it was just sufficient in
some cases.
Also on a host with a few GiB the whole dataset ends up residing
in memory. After all a CE may have a load characterized by a job
turnover rate of a few per second, and manage perhaps a thousand
jobs, and the total (active) data needed to represent them should
fit in a few GiB, and writes should not be that frequent.
But there indeed are sources of disk arm contention, like OS
logging, and critically as mentioned above, backups, so a nice
RAID10 of 4 disks looks to me better than a RAID1 of 2 disks.
Even the DPM SE seems to require relatively small memory and disk
footprints (but then this site us not doing analysis):
http://ganglia.dur.scotgrid.ac.uk/ganglia/graph.php?g=mem_report&z=large&c=Grid%20Servers&h=se01.dur.scotgrid.ac.uk&m=&r=year&s=descending&hc=4&st=1316641624
but then while a big analysis site may have lots more files
registered in the DPM, the number of metadata queries against the
SE is really proportional to number of jobs (and I think that
even analisys jobs open very few), not total number of files
stored.
|