Ahh, there is nothing quite like a nice big cluster to bring any file
server to its knees.
My experience with cases like this is that the culprit is usually NFS,
or the disk file system being used on the RAID array. It is generally
NOT a bandwidth problem. Basically, I suspect the bottleneck is the
kernel on your file server trying to figure out where to put all these
blocks that are flying in from the cluster. One of the most expensive
things you can do to a file server is an NFS write. This is the reason
why NFS has the "noatime" option, since reading a file normally involves
updating the "access time" for that file (and this is a "write"
operation). Alternately, the disk file system itself (ext3?) can also
get bogged down with many simultaneous writes. XFS is supposed to be
less prone to this problem, but I have heard mixed reviews. In
addition, writing a large number of files simultaneously seems to be a
great way to fragment your file system. I don't know why this is so,
but I once used it as a protocol to create a small, heavily fragmented
file system for testing purposes! If the file system is fragmented,
then access to it just starts getting slow. No warnings, no CPU
utilization, just really really long delays to get an "ls".
That is my guess. What I suggest doing is to come up with a series of
tests that "simulate" this event at various stages. That is, first use
the timeless classic unix "dd" command to generate a crunch of 2GB files
pseudo-simultaneously LOCALLY on the file server:
/bin/tcsh
set time
foreach file ( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 )
dd if=/dev/zero bs=2G count=1 of=/home/username/deleteme$file
end
If this hoses the home directories, then you know the cluster and
network has nothing to do with your problem. You can do a
quick-and-dirty benchmark of your RAID performance like this too.
Setting the "time" variable above means you will get timing statistics
for each command. If you divide the total number of bytes by the total
time, then you get the average write performance. It will be
interesting to play with the number of files as well as the division of
the 2GB into blocks with the "bs" and "count" options. Large block
sizes will eat a lot of memory and small block sizes will eat a lot of
CPU. Setting your NFS block size to the maximum (in /etc/fstab) is
generally a good idea for scientific computing. Also, network blocks
are typically 1500 bytes (the MTU size). If this block size is a
problem, then you might want to consider "jumbo frames" on your cluster
subnet.
However, I expect the best thing is to find a way to avoid a large
number of simultaneous NFS writes. Either use a different transfer
protocol (such as rcp or nc), or use some kind of "lock file" to prevent
your completing jobs from copying their files all at the same time.
HTH
-James Holton
MAD Scientist
Harry M. Greenblatt wrote:
> BS"D
>
> To those hardware oriented:
>
> We have a compute cluster with 23 nodes (dual socket, dual core
> Intel servers). Users run simulation jobs on the nodes from the head
> node. At the end of each simulation, a result file is compressed to
> 2GB, and copied to the file server for the cluster (not the head node)
> via NFS. Each node is connected via a Gigabit line to a switch. The
> file server has a 4-link aggregated Ethernet trunk (4Gb/S) to the
> switch. The file server also has two sockets, with Dual Core Xeon
> 2.1GHz CPU's and 4 GB of memory, running RH4. There are two raid
> arrays (RAID 5), each consisting of 8x500GB SATA II WD server drives,
> with one file system on each. The raid cards are AMCC 3WARE 9550 and
> 9650SE (PCI-Express) with 256 MB of cache memory .
>
> When several (~10) jobs finish at once, and the nodes start copying
> the compressed file to the file server, the load on the file server
> gets very high (~10), and the users whose home directory are on the
> file server cannot work at their stations. Using nmon to locate the
> bottleneck, it appears that disk I/O is the problem. But the numbers
> being reported are a bit strange. It reports a throughput of only
> about 50MB/s, and claims the "disk" is 100% busy. These raid cards
> should give throughput in the several hundred MB/s range, especially
> the 9650 which is rated at 600MB/s RAID 6 write (and we have RAID 5).
>
> 1) Is there a more friendly system load monitoring tool we can use?
>
> 2) The users may be able to stagger the output schedule of their
> jobs, but based on the numbers, we get the feeling the RAID arrays are
> not performing as they should. Any suggestions?
>
> Thanks
>
> Harry
>
>
> -------------------------------------------------------------------------
>
> Harry M. Greenblatt
>
> Staff Scientist
>
> Dept of Structural Biology [log in to unmask]
> <mailto:[log in to unmask]>
>
> Weizmann Institute of Science Phone: 972-8-934-3625
>
> Rehovot, 76100 Facsimile: 972-8-934-4159
>
> Israel
>
>
>
|