Hi Storage Experts,
In the absence of any well-known procedure for burning in or stress
testing file servers, I thought I would try a naive approach and see
what happened. Now I have problems but don't know how they have arisen
or whether I am simply making unreasonable demands on the system.
My naive test procedure involves simply copying a lot of bytes from
/dev/zero onto multiple filesystems on our new RAID servers. So
basically I create one 60 TB partition on each RAID, make it into an LVM
physical volume, created a volume group on top of that, and then divide
it into six or so logical volumes, creating an XFS filesystem on each.
Then I start writing to these in parallel as follows:
dd if=/dev/zero of=/mnt/data/temp1/testfile bs=1M &
dd if=/dev/zero of=/mnt/data/temp2/testfile bs=1M &
etc.
This does not make any allowance for possible file-size limits, but I
would have hoped at least for a graceful exit with a helpful error
message. Instead, one of the servers has stopped writing to the disks
and displays an impressive variety of errors in /var/log/messages,
starting with:
Mar 7 08:23:58 nfs2 kernel: aacraid: Host adapter abort request (0,0,1,0)
Mar 7 08:23:58 nfs2 kernel: aacraid: Host adapter abort request (0,0,1,0)
Mar 7 08:24:56 nfs2 last message repeated 188 times
Mar 7 08:24:56 nfs2 kernel: aacraid: Host adapter reset request. SCSI
hang ?
Mar 7 08:24:56 nfs2 kernel: sd 0:0:1:0: SCSI error: return code =
0x08000002
Mar 7 08:24:56 nfs2 kernel: sdb: Current: sense key: Hardware Error
Mar 7 08:24:56 nfs2 kernel: Add. Sense: Internal target failure
Mar 7 08:24:56 nfs2 kernel:
Mar 7 08:24:56 nfs2 kernel: end_request: I/O error, dev sdb, sector
53707122737
Mar 7 08:24:56 nfs2 kernel: I/O error in filesystem ("dm-6") meta-data
dev dm-6 block 0x28001a68f ("xlog_iodone") error 5 buf count 2048
This is a SuperMicro server, running SL5, with an Adaptec RAID controller.
Any suggestions? My inclination is to try reconfiguring the RAID from
scratch and designing a test procedure that limits file sizes to say 1
TB, but if this is indicative of a real underlying problem then maybe
someone here can say so. One of the messages does say "Hardware Error"
but how conclusive is this?
Cheers,
Ben
--
Dr Ben Waugh Tel. +44 (0)20 7679 7223
Dept of Physics and Astronomy Internal: 37223
University College London
London WC1E 6BT
|