>> unless you're expecting very small files...
> Small files on the software area?
Indeed I think that the NFS wire-block-size is a complicated issue and I think that smaller is better than larger unless special cases apply.
In effect a larger block size is a read-ahead and whether it is worthwhile it depends a lot on access patterns as in the size of files, and the cost in both latency and throughput. But there are significant downsides to larger read-aheads over the wire:
* A size of 1024 fits well within a standard Ethernet frame;
one of 4096 fits well within 3 frames; and 8192 fits well
within a jumbo frame. All 3 match fairly well the memory
allocation sizes within the kernel (1024 less so). Does
not matter if there are no frame losses, no memory pressure
etc.
* Receiving a block is synchronous, that is the application
can only read the first byte of the block when the last byte
has been received. Especially in the case of multi-threaded
access patterns this can increase latency. A 32KiB block
will take 0.3-0.4ms at 1Gb/s, which may not matter much,
larger block sizes will have longer latencies (10ms for
the maximum 1MiB block size).
* Block sizes larger than a frame may involve multiple frames
and potentially retransmissions, and this applies even to
LANs if there are significantly star-shaped patterns, and
switch/router buffer congestion losses with huge hits to
latency.
For disks relatively large read-ahead is a lot less of a problem, as time-to-transmit and memory size don't matter that much (with PCIe and SATA and Hypertransport/QPI).
I think large wire-block-sizes optimize for the single-threaded purely-sequential large-file case, but are not good in general.
On balance I thinks that 4096 and 8192 are still good. I'd choose 4096 for standard Ethernet frames and 8192 for jumbo frames, the latter may be a bigger overall optimization than the specific NFS block size.
BTW I just checked read/write purely sequential transfer rates with various wire-block-sizes (on somewhat old hw and quiet systems and network) and I got (fairly repeatably) with a 'dd' block size of 1M (and using 'direct' and other precautions):
size: 1024 4096 8192 16384 32768 262144
read: 21.1 60.9 74.6 83.2 91.1 99.5
write: 11.6 29.6 37.0 46.0 48.9 49.4
Larger block sizes do improve the best-case scenario rates, but while I wouldn't use 1024, I would still use 4096 or 8192 in most cases except when large sequential IO is going to dominate.
It looks like there is no (adaptive or not) read-ahead or write-behind for NFS Linux, and the block-size entirely substitutes for it. The write rates are limited by the 'sync' option and the poor implementation of writing in the Linux NFS client:
http://www.sabi.co.uk/blog/0707jul.html#070701b
> Clearly never going to happen.
I imagine some ":-)" after this... "Just for fun" (while waiting for disk copies and backups to happen) I have had a look at a sw area here (I tend to worry more than most about full-'fsck' and backup-restore times) and the numbers are:
# ls /export/experimental-software/
atlas cdf dteam gridpp mice pheno scotgrid totalep
biomed cms dzero ilc ngs phenosgm supernemo zeus
camont compchem enmr lhcb ops planck supernemo.vo.eu-egee.org
# df -T -i /export/experimental-software/.
Filesystem Type Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg01-experimental--software
xfs 480197744 11849252 468348492 3% /export/experimental-software
# df -T -BG /export/experimental-software/.
Filesystem Type 1G-blocks Used Available Use% Mounted on
/dev/mapper/vg01-experimental--software
xfs 500G 389G 112G 78% /export/experimental-software
# find /export/experimental-software -type f | wc -l
9339746
# find /export/experimental-software -type f -size -8k | wc -l
7520502
~12m inodes, of which ~9m files (in 3m, for ~390G or around 32KB on average, but ~7.5m files are less than 8KiB. Which actually means that the 32768 as the NFS block size will rarely matter as almost always the file is smaller than the block size.
Uhm this might take many hours (probably days) to 'fsck', or restore from backups. Not a nice prospect. I guess that a site would be essentially unavailable until the sw collection is back...
|