Hi Angus,

I’ve asked a colleague for clarification, but I might not get any. The detail is probably in one of the full business case documents which I am not to share at the moment. I could try to find that to be sure if it was very important, (I don’t think it is, but if you really want me to try to find this out for sure, let me know).

I think, but I’m not sure that the figures reflect:

The spread of data volumes based on previous projects and research group future projections. (Estimates based on partial information gleaned from the more engaged research groups)

Archived data being data stored after a projects funding period concludes.

Single copy volumes.

Additionally I think the figures are also for RCUK+ (RCUK, Horizon 2020, CRUK and Wellcome) funded projects only.

Multiple copies are generally wanted, but in some cases a single copy might be fine (where the data can be reproduced from source via relatively inexpensive computation), two copies might be fine also for some data stored only on campus. I think with three copies, we would pretty much always be storing the third copy off campus.

I’ll post back if I get confirmation/clarification on this.

HTH

Andy

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Angus Whyte
Sent: 13 January 2015 12:29
To: [log in to unmask]
Subject: Re: Data repository storage volumes and growth

Hi Andy,

Thanks for sharing this information. It looks a useful and straightforward approach. I wondered if you could shed any light on other assumptions;
- is the spread of data volumes based on previous projects, or the relevant PI/group projections about future projects?
- do the data ranges refer to all data collected in projects, or to some proportion of that expected to be retained at the end?
-does the 'total Leeds archive data' refer to a single or multiple archival copies, i.e. how many copies will you have in total?

Regards,

Angus

On 13/01/2015 11:02, Andy Turner wrote:

Hi List,

Apologies if the attachment does not come through, but there should be an Excel spreadsheet attached to this message. I’ve pasted in the content too, so apologies if that is formatted horribly by email.

During the work on the Research Data Leeds business case a simple model for the University-wide volume and projected volume of research data to be archived was constructed. The model is based on the number of awards per year and a simple model for the spread of data volumes across the awards (based on liaison with a Research Data Management Working Group). Our enterprise architect specialising in research data has excerpted the relevant section from our overall business costing model and got permission for me to share this openly.

Repository data volume analysis - Business-as-usual projects

Project output data range

%age of projects

Number
awards
per year

Total data predicted per year
(TB)

Min

Max

Mean

10 TB

to

100 TB

3%

10

97.43

974.25

535.84

1 TB

to

10 TB

7%

23

22.73

227.33

125.03

0.1 TB

to

1 TB

18%

58

5.85

58.46

32.15

0.01 TB

to

0.1 TB

32%

104

1.04

10.39

5.72

0 TB

to

0.01 TB

40%

130

0.00

1.30

.0 TB

Grand total

127 TB

1,272 TB

699 TB

Min (TB)

Max (TB)

Mean (TB)

Output data mastered in an external repository

-13 TB

-127 TB

-70 TB

Total Leeds archive data predicted per year at full influx rate

114 TB

1,145 TB

629 TB

New archive data from projects predicted in year 1

11 TB

114 TB

63 TB

New archive data from projects predicted in year 2

34 TB

343 TB

189 TB

Assumptions

Percentage of data mastered in an external repository

10%

(Estimate based on talking to various faculties about their use of external repositories)

Year 1 estimated percentage of full influx rate

10%

(Because most projects already in flight will not use the service)

Year 2 estimated percentage of full influx rate

30%

(Because most projects already in flight will not use the service, but there will be more than in year 1)

I’ve sent this information to Robert already who did some quick calculations after looking at the model and reckons that his figures are commensurate based on a crude scaling.

I like the sort of log10 approach to this estimation and that a min, max and mean has been calculated to give an idea of the uncertainties and variance.

I might be allowed to share some further details of the Research Data Leeds business cases, but as things are I am not to share the full details yet. This is partly because of strategic concerns to do with competition and procurement and because of the uncertainties involved. More details are likely to be shared when the University has progressed further with implementation.

HTH

Andy
http://www.geog.leeds.ac.uk/people/a.turner/index.html

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Andy Turner
Sent: 08 January 2015 12:53
To: [log in to unmask]
Subject: Re: Data repository storage volumes and growth

Hi Robert, List,

I think this has been touched on before on this list, (sorry, but it may have been another, or I may be mistaken, but there is some information somewhere about this)… The best I’ve found searching for a specific thread on this list is this one on “Research data quota takeup”:

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1410&L=RESEARCH-DATAMAN&D=0#28

This relates back in a way to Simon’s question:

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=RESEARCH-DATAMAN;f3c2c500.1205

There is I think a power law type of distribution to this. In your institution there will be some researchers/research groups that have a large production and storage requirement of data, there will also be a lot of researchers with a relatively small storage requirement, but this all can add up to something significant.

There is a big difference in storing sensitive data and data that can be made more openly available, so you might want to try to estimate these volumes separately.

The University of Leeds developed a business case probably similar to what you are doing a bit over a year ago. I could ask about sharing some details of this with you if you want.

Best wishes,

Andy
http://www.geog.leeds.ac.uk/people/a.turner/index.html

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Robert Darby
Sent: 08 January 2015 11:01
To: [log in to unmask]
Subject: Data repository storage volumes and growth

Hello

I am currently working with colleagues at the University of Reading on a business case for a research data repository and we wanted to define some cost parameters for our archive storage requirement over the next five years. I am interested to know if anybody has attempted to model expected archive storage volumes over a 3-5 period, or, where services have already been established, if anyone can share data about year-on-year growth in storage volumes.

To be clear: this is the storage requirement specifically for archiving/publishing data supporting published outputs in compliance with EPSRC and other public funders’ policies, where suitable external data centres cannot be used. Our business case will recommend implementing a service integrating EPrints and Arkivum, and we hope to begin implementation in early 2015. We are expecting to begin with a narrow compliance-focused data collection policy and that during the first year or two we will effectively be in a pilot phase with relatively low usage. It is assumed that as the service becomes more established the collection policy may broaden to include data arising from other research not funded by the big public funders and data from unfunded research.

I therefore assumed in years 1 and 2 a requirement for maybe 1-5 TB storage, with a more steeply rising curve in years 3-5 to <100TB by year 5. The general view of my colleagues is that this is far too low. But I’m willing to throw it in as a reference point to get things started…

I realise there are so many variables in the mix that any meaningful numbers or comparisons between organisations are probably not possible, but I would be interested at least to have a sense of the scales of actual/projected storage others are working with. Does anybody out there have any relevant information they would be willing to share?

I should greatly appreciate any help!

Thank you

Robert

Dr Robert Darby

Research Data Management Project Manager

Research and Enterprise Development

The University of Reading
Whiteknights
Reading RG6 6AH
Tel: 0118 378 6161

[log in to unmask]

--

Dr Angus Whyte

Senior Institutional Support Officer

Digital Curation Centre

University of Edinburgh

Crichton St, Edinburgh EH8 9LE

+44-131-650-9980

The University of Edinburgh is a charitable body, registered in

Scotland, with registration number SC005336.

Repository data volume analysis - Business-as-usual projects

	Project output data range			%age of projects	Number awards per year	Total data predicted per year (TB)
	Project output data range			%age of projects	Number awards per year	Min	Max	Mean
	10 TB	to	100 TB	3%	10	97.43	974.25	535.84
	1 TB	to	10 TB	7%	23	22.73	227.33	125.03
	0.1 TB	to	1 TB	18%	58	5.85	58.46	32.15
	0.01 TB	to	0.1 TB	32%	104	1.04	10.39	5.72
	0 TB	to	0.01 TB	40%	130	0.00	1.30	.0 TB
	Grand total					127 TB	1,272 TB	699 TB


						Min (TB)	Max (TB)	Mean (TB)
	Output data mastered in an external repository					-13 TB	-127 TB	-70 TB
	Total Leeds archive data predicted per year at full influx rate					114 TB	1,145 TB	629 TB
	New archive data from projects predicted in year 1					11 TB	114 TB	63 TB
	New archive data from projects predicted in year 2					34 TB	343 TB	189 TB


*Assumptions*

						Percentage of data mastered in an external repository	10%	(Estimate based on talking to various faculties about their use of external repositories)
						Year 1 estimated percentage of full influx rate	10%	(Because most projects already in flight will not use the service)
						Year 2 estimated percentage of full influx rate	30%	(Because most projects already in flight will not use the service, but there will be more than in year 1)