Hi Henry
Transfers being written when the filesystem goes read-only - in theory the client should get and error and clean up afterwards, but this is not always the case...
Shaun
-----Original Message-----
From: Henry Nebrensky [mailto:[log in to unmask]]
Sent: 26 November 2012 08:58
To: De Witt, Shaun (STFC,RAL,SC)
Cc: [log in to unmask]
Subject: Re: Checksum checking
Hi,
I'm bemused at least, if not baffled: how does a read-only file system
cause a checksum failure?
Thanks
Henry
On Sun, 25 Nov 2012, Shaun De Witt wrote:
> Hi Chris
>
> Ignoring known issues likely to cause checksum failures (power outages, read-only file systems, netweork glitches, etc) I would suggest we see around 2-8 errors/wk across all disk servers. Of these, almost all are tracable to either a bug in the checksum algorithm used within CASTOR or due to client problems with the file later removed by the VO. The actual rate of 'file degradation' - inexplicable checksum errors probably caused by bit-rot - is maybe 2-3 a year.
>
> NOTE: These figures are at best an educated guess. Gareth may have more definitive numbers.
>
> Shaun
> ________________________________________
> From: GRIDPP2: Deployment and support of SRM and local storage management [[log in to unmask]] on behalf of Christopher J. Walker [[log in to unmask]]
> Sent: 25 November 2012 18:07
> To: [log in to unmask]
> Subject: Checksum checking
>
> Whilst transfers are now checksummed, do sites routinely checksum files
> after arrival (I know RAL does, but not what they see)?
>
> How many errors do you see?
> When do you see these errors?
> What sort of errors are they?
>
> ATLAS routinely checksum files at sites before they are used by jobs. Do
> we know what sort of error rate they see?
--
Dr. Henry Nebrensky [log in to unmask]
http://people.brunel.ac.uk/~eesrjjn
"The opossum is a very sophisticated animal.
It doesn't even get up until 5 or 6 p.m."
--
Scanned by iCritical.
|