Bristol Storage hardware update:
Replaced SE SCSI controller from Adaptec to LSI in Aug'08; no change,
still under even fairly light load, log SCSI errors & remount read-only.
In Oct'08 I tried under Transtec's instructions to debug this 16-bay
Infortrend/EonStore Transtec RAID array. Turned out it couldn't even run
their get-config java code (the 8-bay same-manufacture RAID array could
run it fine). A Transtec tech suspected some corruption in the RAID array
embedded software. 3 disks failed in 3 months; Transtec agreed the
16-bay RAID array seemed faulty & sent us a replacement (mid-Dec).
Swapped in, early Jan09; it saw the logical drives & partitions on the
disks. The LUNS were reconfigured to match Lancaster's & Birmingham's
config: instead of id 0...5 all LUN 0, now id 0 LUN 0..5
(I really don't think that config was the problem though)
Gentle xfer testing: not one SCSI error; further higher-volume xfer
testing in progress, again so far not one SCSI error. Now the bottleneck
is the 32-bit Streamline SE hardware (2 x 2.8GHz Xeon, 4GB RAM); as
Ewan pointed out, it normally seems to keep the 3.xGB mysql db in RAM,
then under high I/O it swaps itself nearly to death : load-avg climbs to
10, 15, 20, 40, 60... previously then SE would crash, now I stop CMS xfers
before a crash, which probably was what damaged ext3 fs previously.
I'm trying to work with CMS to find an FTS config that will keep the
SE load-avg under 10, as the hardware can probably handle that over
sustained hours-long transfers - this had some success in September,
before SCSI errors would cause the SE to remount fs from the 16-bay RAID
array read-only. (Zero problems with the 8-bay, except both PSU have died
at separate times.)
If GPFS (not yet in production) succeeds, this DPM server & RAID arrays
will be retired; else the DPM server heardware will be upgraded.
BTW, initial data was 5 disk fails in 3 months, 3 in the same slot. But
inspecting those 2 same-slot replacement 'failed' disks (Seagate
ST3750330AS), ordered by us from Insight as on-site spares (since Transtec
wants the damaged disk back, & weeks later sends back a replacement)
showed faulty manufacture: they have _two_ top plates. This makes the
top-screws sit higher, & was most(?) possibly the reason the RAID array
rejected them quite fast (ca.2weeks); testing the disks after removal from
the RAID array revealed only one minor sector error. Insight is taking the
defective disks back & will send back correctly-manufactured ones.
Both were 2008-manufacture.
But still, 3 disk fails in 3 different slots in 3 months, was grounds for
rejecting the 16-bay RAID array as faulty. Replacement looks good so far.
Research continues...
|