On 26/06/12 22:18, Jens Jensen wrote:
> Oh and before you ask, I wrote it in C :-)
The Castor guys have something written in python - I'll dig out a copy,
but Shawn deWitt gave it to me IIRC.
also runs through and calculates the checksum - you'd just need to
modify it slightly to do a comparison (oh and worry about whether you do
leading zeros or not).
In fact I ran this on some storage at QMUL - and nearly had a heart
attack - due to the leading zeros.
> Thanks, Chris.
> On the subject of the "slow dump", this is something I played with at /home
> (ie /home/jens at home, so home squared) following my corrupted file blog
> post, although I haven't got anything running yet but it should only be
> a matter of some holiday and I should get it running.
> My idea was I'd open a file and checksum it (using ADLER32 implemented in
> a separate compilation unit compiled with high optimisation), and the
> program would recurse through a given toplevel directory (like /home).
> Whenever it had checksummed a number of files whose combined filesizes
> are>N, or a single file of size>N (where N is, say, five megs), it would
> then sleep for a while.
The devil is in the detail - data coming in at 1.5 Gbit means that you'd
need to checksum in parallel if the machine you are doing the
checksumming on only has a Gbit connection (realising that checksumming
1 day of input data was going to take more than a day caused me to worry
about it). Also, you might wish to randomly sample across disk servers -
but perhaps checking data from yesterday more rigorously (to make sure
it really did hit disk). But you are right, it's fundamentally not a
particularly difficult problem - and one we can and should easily solve.
> For any checksum it;d query a database with (name, cksum, ctime, atime) where name is the relative pathname, ctime is the time the entry is created (first checksum), and atime is the most recent time it is checked. Conversely, a checker could work the other way, from the name in the database, to see if files have gone missing.
> It could of course be adapted to using RFIO...
> Before I finsih writing it, does anyone know if anyone else has written such a tool? I was just hacking but I won't put any serious effort into it if a tool exists...
> From: Christopher J.Walker [[log in to unmask]]
> Sent: 26 June 2012 16:00
> To: Jensen, Jens (STFC,RAL,ESC)
> Cc: [log in to unmask]
> Subject: Re: Agenda for tomorrow
> On 26/06/12 15:03, Jens Jensen wrote:
>> I am at CWI in the Netherlands and need to run soon to catch a rescheduled flight from sunny warm Amsterdam back to rainy old Blighty.
> I'll have to send my apologies - I've got another meeting I have to be
> in (and a minor Lustre upgrade to do).
>> Things to cover tomorrow (possibly more stuff to be added):
>> * That checksummy thing - does it make sense to syncat (incl checksum) regularlyfor Vos?
>> How much work would that be?
> That's two questions.
> I think it would make sense for sites to regularly (for some definition
> of regularly) checksum the data held on their storage and compare with
> the known checksum (stored in file metadata for StoRM). This will at
> least warn against silent corruption on disk. Sites can then file a GGUS
> ticket if they do see corruption.
> This is quite resource intensive - but could at least potentially be
> throttled depending on site load. It could also be intelligent and try
> to randomly sample files stored on different servers.
> A program already exists for Castor - it needs some adapting for StoRM -
> and presumably could be adapted for DPM and dCache too.
> I'd estimate it would take a week of my time to get something like this
> working solidly (I just need to find that week).
> The second question is should we produce syncat dumps regularly. Well,
> the main purpose of this is to do consistency checking against the LFC.
> The only reason for us to produce them regularly is that it is believed
> that doing it systematically through the SRM interface is too resource
> Quite frankly, at the moment producing these dumps by hand is incredibly
> resource intensive on my precious time. I think that producing a tool
> that can slowly go through an SRM would be a big step forward (perhaps
> one that produced syncat dumps) - even if it needed to be throttled and
> took a large amount of time - it would allow a VO to do this without
> site admin involvement - and that's the expensive thing IMHO. If it
> really is too resource intensive, then putting syncat dumps in a
> standard place would be a way forward - but whether we'd get agreement
> on doing this before the storage providers implement their syncing
> method I don't know.
> I'd estimate 2 weeks of someone's time (possibly more) to write a script
> that did srm calls and produced a syncat dump. It might also be
> interesting for this to be linked into ganga somehow.
> A script from ATLAS already exists under a free licence to compare LFC
> and syncat dumps.
>> * Storage-relevant stuff from OGF last week (not that much actually).