new SCSI Cables: Adaptec cites only 1 VHDCI-VHDCI cable (Insight has it) for
ca.£30; 2m long. Much longer than needed.
The expensive ones Insight sells are 2m & 3.7m!!
Bris SE was hung by 8am, unresponive with load of near 30; failed SAM
tests since 4am. sar says i/o wait hit 90% at 5am. Was down to 14% by
8am, but %system steady at >80%. gridftp....
Took the opportunity to shut SE & RAID arrays down, replace SCSI cables
with identical never-used ones (less than 1m long) (Dr Newbold examined &
said they were good quality, not junk).
Didn't last long
Aug 7 09:22:07 lcgse01 kernel: scsi1: Transmission error detected
Aug 7 09:22:07 lcgse01 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
Aug 7 09:22:07 lcgse01 kernel: scsi1: Dumping Card State at program address 0x19c Mode 0x11
Aug 7 09:22:07 lcgse01 kernel: Card was paused
etc etc.
These were interesting, isolated without much around them in messages:
Aug 7 11:36:28 lcgse01 kernel: KERNEL: assertion (!sk->sk_forward_alloc) failed at net/core/stream.c (279)
Aug 7 11:36:28 lcgse01 kernel: KERNEL: assertion (!sk->sk_forward_alloc) failed at net/ipv4/af_inet.c (152)
googling for those only shows very old (2005/6) discussion, about e1000 NIC
driver. According to sar this SE was nearly quiescent at that time.
And then again - again when sar shows no real load:
Aug 7 12:57:51 lcgse01 kernel: scsi1: Transmission error detected
Aug 7 12:57:51 lcgse01 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
Aug 7 12:57:51 lcgse01 kernel: scsi1: Dumping Card State at program address 0x199 Mode 0x11
Aug 7 12:57:51 lcgse01 kernel: Card was paused
....
Aug 7 12:58:10 lcgse01 kernel: STACK: 0x1f2 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Interestingly it only named 5 of the 6 1.5TB partitions. And no message about
"remounting read-only". It's confirmed that the partitions are still
writeable. So, no reboot yet...
Transtec says ensure Adaptec 39320A-R has latest FW. Adaptec lists latest
FW as 12 Jul 2004. How likely might it be the Adaptec card in this
server, delivered in spring 2005, would have FW older than that?
(Will def check next reboot)
> We had problems with Adaptec-cards in connection with the LSI-chipset of
> the PV610s, which have been resolved by the above suggestions. Some
> older cards may still misbehave, in this case I would recommend to go
> for an LSI22320 SCSI-card instead..
(Bris 16-bay is PV610S16R1C, Matt is that what Lancaster has?)
Also Transtec "vaguely remembers*" that this
lcgse01 kernel: Attached scsi disk sdd at scsi1, channel 0, id 0, lun 0
lcgse01 kernel: Attached scsi disk sde at scsi1, channel 0, id 1, lun 0
lcgse01 kernel: Attached scsi disk sdf at scsi1, channel 0, id 2, lun 0
lcgse01 kernel: Attached scsi disk sdg at scsi1, channel 0, id 3, lun 0
lcgse01 kernel: Attached scsi disk sdh at scsi1, channel 0, id 4, lun 0
lcgse01 kernel: Attached scsi disk sdi at scsi1, channel 0, id 5, lun 0
"causing lots of SCSI-errors, I would recommend to re-map all LUNs and map
them towards the same SCSI-ID..."
So it would be all scsi1, channel 0, id 1, lun 0..5
Matt, is that what Lancaster has? How do other sites config their RAID
arrays?
* "vaguely remember" doesn't sound like very reliable Vendor Support.
What if doing that wrecks our storage? Moral of story: Vendor does not
care, act accordingly.
|