JISCMail - RESEARCH-DATAMAN Archives

Dear Steffi

Thank you for asking about this. There is an implied requirement for an institutional data repository to address cases where a researcher has a need (generally based on funder requirement) to formally archive data and (if possible) make the data publicly available, but has no ready means of doing this because:

a) in their discipline there are no external data centres (managed by funders or by other stakeholders for the disciplinary community) – not all disciplines are as well served as molecular biology;

b) there are data centres but data that the researcher needs to archive to meet the funder’s requirement are out of scope of their collection policies;

c) there are data centres but they do not meet required standards for sustainable long-term preservation or are otherwise unsuitable;

d) for other institutional or professional purposes there may be a requirement to preserve and publish data at the institution.

I think this covers most cases we anticipate. Probably the case of most immediate interest to many UK universities is where EPSRC requires long-term preservation (and where possible open publication) of data supporting research publications, as well as online publication of the related metadata. Organisations in receipt of EPSRC funding are expected to ensure their researchers can meet this requirement from 1 May 2015…

I hope this explains the purpose of my original question.

Regards

Robert

Dr Robert Darby

Research Data Management Project Manager

Research and Enterprise Development

The University of Reading
Tel: 0118 378 6161

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Stephanie Suhr
Sent: 13 January 2015 11:23
To: [log in to unmask]
Subject: Re: Data repository storage volumes and growth

Hi all,

Apologies for jumping back to the beginning in a way, but Robert writes “this is the storage requirement specifically for archiving/publishing data supporting published outputs […] where suitable external data centres cannot be used.”

Robert, I am just curious: why can external data centres (that are even “suitable”) not be used in this case?

Steffi

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Andy Turner
Sent: 13 January 2015 11:03
To: [log in to unmask]
Subject: Re: Data repository storage volumes and growth

Hi List,

Apologies if the attachment does not come through, but there should be an Excel spreadsheet attached to this message. I’ve pasted in the content too, so apologies if that is formatted horribly by email.

During the work on the Research Data Leeds business case a simple model for the University-wide volume and projected volume of research data to be archived was constructed. The model is based on the number of awards per year and a simple model for the spread of data volumes across the awards (based on liaison with a Research Data Management Working Group). Our enterprise architect specialising in research data has excerpted the relevant section from our overall business costing model and got permission for me to share this openly.

Repository data volume analysis - Business-as-usual projects

	Project output data range			%age of projects	Number awards per year	Total data predicted per year (TB)
	Project output data range			%age of projects	Number awards per year	Min	Max	Mean
	10 TB	to	100 TB	3%	10	97.43	974.25	535.84
	1 TB	to	10 TB	7%	23	22.73	227.33	125.03
	0.1 TB	to	1 TB	18%	58	5.85	58.46	32.15
	0.01 TB	to	0.1 TB	32%	104	1.04	10.39	5.72
	0 TB	to	0.01 TB	40%	130	0.00	1.30	.0 TB
	Grand total					127 TB	1,272 TB	699 TB


						Min (TB)	Max (TB)	Mean (TB)
	Output data mastered in an external repository					-13 TB	-127 TB	-70 TB
	Total Leeds archive data predicted per year at full influx rate					114 TB	1,145 TB	629 TB
	New archive data from projects predicted in year 1					11 TB	114 TB	63 TB
	New archive data from projects predicted in year 2					34 TB	343 TB	189 TB


*Assumptions*

						Percentage of data mastered in an external repository	10%	(Estimate based on talking to various faculties about their use of external repositories)
						Year 1 estimated percentage of full influx rate	10%	(Because most projects already in flight will not use the service)
						Year 2 estimated percentage of full influx rate	30%	(Because most projects already in flight will not use the service, but there will be more than in year 1)

I’ve sent this information to Robert already who did some quick calculations after looking at the model and reckons that his figures are commensurate based on a crude scaling.

I like the sort of log10 approach to this estimation and that a min, max and mean has been calculated to give an idea of the uncertainties and variance.

I might be allowed to share some further details of the Research Data Leeds business cases, but as things are I am not to share the full details yet. This is partly because of strategic concerns to do with competition and procurement and because of the uncertainties involved. More details are likely to be shared when the University has progressed further with implementation.

HTH

Andy
http://www.geog.leeds.ac.uk/people/a.turner/index.html

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Andy Turner
Sent: 08 January 2015 12:53
To: [log in to unmask]
Subject: Re: Data repository storage volumes and growth

Hi Robert, List,

I think this has been touched on before on this list, (sorry, but it may have been another, or I may be mistaken, but there is some information somewhere about this)… The best I’ve found searching for a specific thread on this list is this one on “Research data quota takeup”:

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1410&L=RESEARCH-DATAMAN&D=0#28

This relates back in a way to Simon’s question:

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=RESEARCH-DATAMAN;f3c2c500.1205

There is I think a power law type of distribution to this. In your institution there will be some researchers/research groups that have a large production and storage requirement of data, there will also be a lot of researchers with a relatively small storage requirement, but this all can add up to something significant.

There is a big difference in storing sensitive data and data that can be made more openly available, so you might want to try to estimate these volumes separately.

The University of Leeds developed a business case probably similar to what you are doing a bit over a year ago. I could ask about sharing some details of this with you if you want.

Best wishes,

Andy
http://www.geog.leeds.ac.uk/people/a.turner/index.html

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Robert Darby
Sent: 08 January 2015 11:01
To: [log in to unmask]
Subject: Data repository storage volumes and growth

Hello

I am currently working with colleagues at the University of Reading on a business case for a research data repository and we wanted to define some cost parameters for our archive storage requirement over the next five years. I am interested to know if anybody has attempted to model expected archive storage volumes over a 3-5 period, or, where services have already been established, if anyone can share data about year-on-year growth in storage volumes.

To be clear: this is the storage requirement specifically for archiving/publishing data supporting published outputs in compliance with EPSRC and other public funders’ policies, where suitable external data centres cannot be used. Our business case will recommend implementing a service integrating EPrints and Arkivum, and we hope to begin implementation in early 2015. We are expecting to begin with a narrow compliance-focused data collection policy and that during the first year or two we will effectively be in a pilot phase with relatively low usage. It is assumed that as the service becomes more established the collection policy may broaden to include data arising from other research not funded by the big public funders and data from unfunded research.

I therefore assumed in years 1 and 2 a requirement for maybe 1-5 TB storage, with a more steeply rising curve in years 3-5 to <100TB by year 5. The general view of my colleagues is that this is far too low. But I’m willing to throw it in as a reference point to get things started…

I realise there are so many variables in the mix that any meaningful numbers or comparisons between organisations are probably not possible, but I would be interested at least to have a sense of the scales of actual/projected storage others are working with. Does anybody out there have any relevant information they would be willing to share?

I should greatly appreciate any help!

Thank you

Robert

Dr Robert Darby

Research Data Management Project Manager

Research and Enterprise Development

The University of Reading
Whiteknights
Reading RG6 6AH
Tel: 0118 378 6161

[log in to unmask]