Oh and before you ask, I wrote it in C :-)
From: GRIDPP2: Deployment and support of SRM and local storage management [[log in to unmask]] on behalf of Jens Jensen [[log in to unmask]]
Sent: 26 June 2012 20:27
To: [log in to unmask]
Subject: Re: Agenda for tomorrow
On the subject of the "slow dump", this is something I played with at /home (ie /home/jens at home, so home squared) following my corrupted file blog post, although I haven't got anything running yet but it should only be a matter of some holiday and I should get it running.
My idea was I'd open a file and checksum it (using ADLER32 implemented in a separate compilation unit compiled with high optimisation), and the program would recurse through a given toplevel directory (like /home). Whenever it had checksummed a number of files whose combined filesizes are >N, or a single file of size >N (where N is, say, five megs), it would then sleep for a while.
For any checksum it;d query a database with (name, cksum, ctime, atime) where name is the relative pathname, ctime is the time the entry is created (first checksum), and atime is the most recent time it is checked. Conversely, a checker could work the other way, from the name in the database, to see if files have gone missing.
It could of course be adapted to using RFIO...
Before I finsih writing it, does anyone know if anyone else has written such a tool? I was just hacking but I won't put any serious effort into it if a tool exists...
From: Christopher J.Walker [[log in to unmask]]
Sent: 26 June 2012 16:00
To: Jensen, Jens (STFC,RAL,ESC)
Cc: [log in to unmask]
Subject: Re: Agenda for tomorrow
On 26/06/12 15:03, Jens Jensen wrote:
> I am at CWI in the Netherlands and need to run soon to catch a rescheduled flight from sunny warm Amsterdam back to rainy old Blighty.
I'll have to send my apologies - I've got another meeting I have to be
in (and a minor Lustre upgrade to do).
> Things to cover tomorrow (possibly more stuff to be added):
> * That checksummy thing - does it make sense to syncat (incl checksum) regularlyfor Vos?
> How much work would that be?
That's two questions.
I think it would make sense for sites to regularly (for some definition
of regularly) checksum the data held on their storage and compare with
the known checksum (stored in file metadata for StoRM). This will at
least warn against silent corruption on disk. Sites can then file a GGUS
ticket if they do see corruption.
This is quite resource intensive - but could at least potentially be
throttled depending on site load. It could also be intelligent and try
to randomly sample files stored on different servers.
A program already exists for Castor - it needs some adapting for StoRM -
and presumably could be adapted for DPM and dCache too.
I'd estimate it would take a week of my time to get something like this
working solidly (I just need to find that week).
The second question is should we produce syncat dumps regularly. Well,
the main purpose of this is to do consistency checking against the LFC.
The only reason for us to produce them regularly is that it is believed
that doing it systematically through the SRM interface is too resource
Quite frankly, at the moment producing these dumps by hand is incredibly
resource intensive on my precious time. I think that producing a tool
that can slowly go through an SRM would be a big step forward (perhaps
one that produced syncat dumps) - even if it needed to be throttled and
took a large amount of time - it would allow a VO to do this without
site admin involvement - and that's the expensive thing IMHO. If it
really is too resource intensive, then putting syncat dumps in a
standard place would be a way forward - but whether we'd get agreement
on doing this before the storage providers implement their syncing
method I don't know.
I'd estimate 2 weeks of someone's time (possibly more) to write a script
that did srm calls and produced a syncat dump. It might also be
interesting for this to be linked into ganga somehow.
A script from ATLAS already exists under a free licence to compare LFC
and syncat dumps.
> * Storage-relevant stuff from OGF last week (not that much actually).