JISCMail - GRIDPP-STORAGE Archives

Hi,

>> SKA data will be highly dependant on the type of observing but
>> initially (SKA Phase 1) we're looking at around 250TB/day
>> depending on the level of raw data that is archived and this
>> is an absolute maximum. The internal data format is currently
>> under discussion.
>
> That's more or less in uncharted territory. When discussing data
> requirements for a series of lightsource beamlines (aggregate
> datarates potentially higher than CERN) somke years ago someone
> pointed out to me that the astronomers were going to collect a
> lot more data still, and that was a huge problem that astronomers
> would have to deal with. Because the sharp rise in detector specs
> thanks to silicon technology has impacted astronomy even more
> than other sciences.

The CMS event builder moves >100GB/s for processing in HLT, and sends a 
peak rate of 4GB/s (~400TB/day) for storage. I imagine the ATLAS numbers 
are similar. These systems were working in 2008, and will increase 
substantially in throughput in the next few years.

As I said, depending on the amount of pre-processing of the raw data, 
this probably has more in common with a DAQ system than the LHC offline 
computing. The idea of using SRM for this brings me out in a cold sweat.

I believe some current or near-future astronomy projects have larger 
specifications still. Anyone know what LOFAR will move in the next few 
years?

   Dave





>
>> Overall the project is looking long term at anywhere between
>> 100PB ->  3EB per year for the second phase (but that is a
>> while away).
>
>  From what I understand of your type of science and my own
> preferences I would do a totally distributed setup, with data
> storage next to each array component. The difficulty there is
> that at some point I guess you really want to correlate the data
> streams from the various components, and that's going to be an
> issue.
>
> It would be nice I think to try to avoid single huge data centres
> if possible unless one has the same financial resources as the
> NSA, or a science that naturally is centralized (CERN, DNA
> sequencing).
>
> In particular single huge storage pools are a big issue, and
> "cell" style filesystems have somewhat narrow applicability
> domains, as for example SamS has mentioned.
>
> The crucial question I suppose you want to address is the main
> order in which to store data: By collecting instrument? By
> experiment? By time?
>
>> The initial structure is likely to be large single archive at
>> telescope with distributed copies of the data at continentally
>> located regional Science/Engineering centres (so pretty
>> similar to T0-T1 of WLCG). Do you have (or know where I may
>> find) any numbers on the CERN T0 store?
>
> You can find a lot of information in various proceedings of the
> HEPiX workshops (the site reports and data centre discussions
> would be particularly relevant), and probably attending each of
> them would be quite important to your project. The people who do
> DNA sequencing also attend them, it is not just HEP.
>
> The two latest workshops:
>
> http://indico.cern.ch/contributionListDisplay.py?confId=160737
> http://indico.cern.ch/contributionListDisplay.py?confId=138424
>
> In general the storage and site tracks may be most relevant:
>
> http://indico.cern.ch/sessionDisplay.py?sessionId=1&confId=160737#20120423
> http://indico.cern.ch/sessionDisplay.py?sessionId=4&confId=160737#20120424
>
> https://indico.cern.ch/sessionDisplay.py?sessionId=1&confId=138424#20111024
> https://indico.cern.ch/sessionDisplay.py?sessionId=8&confId=138424#20111027