Hi, >> SKA data will be highly dependant on the type of observing but >> initially (SKA Phase 1) we're looking at around 250TB/day >> depending on the level of raw data that is archived and this >> is an absolute maximum. The internal data format is currently >> under discussion. > > That's more or less in uncharted territory. When discussing data > requirements for a series of lightsource beamlines (aggregate > datarates potentially higher than CERN) somke years ago someone > pointed out to me that the astronomers were going to collect a > lot more data still, and that was a huge problem that astronomers > would have to deal with. Because the sharp rise in detector specs > thanks to silicon technology has impacted astronomy even more > than other sciences. The CMS event builder moves >100GB/s for processing in HLT, and sends a peak rate of 4GB/s (~400TB/day) for storage. I imagine the ATLAS numbers are similar. These systems were working in 2008, and will increase substantially in throughput in the next few years. As I said, depending on the amount of pre-processing of the raw data, this probably has more in common with a DAQ system than the LHC offline computing. The idea of using SRM for this brings me out in a cold sweat. I believe some current or near-future astronomy projects have larger specifications still. Anyone know what LOFAR will move in the next few years? Dave > >> Overall the project is looking long term at anywhere between >> 100PB -> 3EB per year for the second phase (but that is a >> while away). > > From what I understand of your type of science and my own > preferences I would do a totally distributed setup, with data > storage next to each array component. The difficulty there is > that at some point I guess you really want to correlate the data > streams from the various components, and that's going to be an > issue. > > It would be nice I think to try to avoid single huge data centres > if possible unless one has the same financial resources as the > NSA, or a science that naturally is centralized (CERN, DNA > sequencing). > > In particular single huge storage pools are a big issue, and > "cell" style filesystems have somewhat narrow applicability > domains, as for example SamS has mentioned. > > The crucial question I suppose you want to address is the main > order in which to store data: By collecting instrument? By > experiment? By time? > >> The initial structure is likely to be large single archive at >> telescope with distributed copies of the data at continentally >> located regional Science/Engineering centres (so pretty >> similar to T0-T1 of WLCG). Do you have (or know where I may >> find) any numbers on the CERN T0 store? > > You can find a lot of information in various proceedings of the > HEPiX workshops (the site reports and data centre discussions > would be particularly relevant), and probably attending each of > them would be quite important to your project. The people who do > DNA sequencing also attend them, it is not just HEP. > > The two latest workshops: > > http://indico.cern.ch/contributionListDisplay.py?confId=160737 > http://indico.cern.ch/contributionListDisplay.py?confId=138424 > > In general the storage and site tracks may be most relevant: > > http://indico.cern.ch/sessionDisplay.py?sessionId=1&confId=160737#20120423 > http://indico.cern.ch/sessionDisplay.py?sessionId=4&confId=160737#20120424 > > https://indico.cern.ch/sessionDisplay.py?sessionId=1&confId=138424#20111024 > https://indico.cern.ch/sessionDisplay.py?sessionId=8&confId=138424#20111027