Hi Andy,

Thanks again for sharing these estimates and for clarifying the points below.  Anything further you can share on the list in due course would be useful I'm sure.  I appreciate its difficult to share specifics of business cases. I notice some storage costs are being shared on www.curationexchange.org BTW if that's of interest.

On the specific estimates for Leeds, as you say there are a lot of uncertainties and the figures seem very sensitive to the assumptions made. At the risk of sounding glib I just wanted to add the observation that being selective about what's archived, so you exclude data that can be easily reproduced (if it can) could make a big difference, especially if focused on the 10% of projects you expect to produce more than 1Tb/year.  That would cut costs or allow you to keep additional copies and reduce risks for data of high value (e.g. where the consequences of loss are high).

Also presumably as the projects at the top end of that range are exceptional demands on your storage the costs can be charged to grants, and the proportion of the costs that can be recovered that way will increase in future years.

Anyway, thanks again. I've found your and Tim Banks's posts on this to the list really informative.

Regards, Angus


On 20/01/2015 12:19, Andy Turner wrote:
[log in to unmask]" type="cite">

Hi Angus,

 

I’ve asked a colleague for clarification, but I might not get any. The detail is probably in one of the full business case documents which I am not to share at the moment. I could try to find that to be sure if it was very important, (I don’t think it is, but if you really want me to try to find this out for sure, let me know).

 

I think, but I’m not sure that the figures reflect:

The spread of data volumes based on previous projects and research group future projections. (Estimates based on partial information gleaned from the more engaged research groups)

Archived data being data stored after a projects funding period concludes.

Single copy volumes.

 

Additionally I think the figures are also for RCUK+ (RCUK, Horizon 2020, CRUK and Wellcome) funded projects only.

 

Multiple copies are generally wanted, but in some cases a single copy might be fine (where the data can be reproduced from source via relatively inexpensive computation), two copies might be fine also for some data stored only on campus. I think with three copies, we would pretty much always be storing the third copy off campus.

 

I’ll post back if I get confirmation/clarification on this.

 

HTH

 

Andy 

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Angus Whyte
Sent: 13 January 2015 12:29
To: [log in to unmask]
Subject: Re: Data repository storage volumes and growth

 

Hi Andy,

Thanks for sharing this information. It looks a useful and straightforward approach. I wondered if you could shed any light on  other assumptions;
- is the spread of data volumes based on previous projects, or the relevant PI/group projections about future projects?
- do the data ranges refer to all data collected in projects, or to some proportion of that expected to be retained at the end?
-does the 'total Leeds archive data' refer to a single or multiple archival copies, i.e. how many copies will you have in total?

Regards,

Angus



On 13/01/2015 11:02, Andy Turner wrote:

Hi List,

 

Apologies if the attachment does not come through, but there should be an Excel spreadsheet attached to this message. I’ve pasted in the content too, so apologies if that is formatted horribly by email.

 

During the work on the Research Data Leeds business case a simple model for the University-wide volume and projected volume of research data to be archived was constructed. The model is based on the number of awards per year and a simple model for the spread of data volumes across the awards (based on liaison with a Research Data Management Working Group). Our enterprise architect specialising in research data has excerpted the relevant section from our overall business costing model and got permission for me to share this openly.

 

Repository data volume analysis - Business-as-usual projects












Project output data range

%age of projects

Number
awards
per year

Total data predicted per year
(TB)



Min

Max

Mean



10 TB

to

100 TB

3%

10

97.43

974.25

535.84



1 TB

to

10 TB

7%

23

22.73

227.33

125.03



0.1 TB

to

1 TB

18%

58

5.85

58.46

32.15



0.01 TB

to

0.1 TB

32%

104

1.04

10.39

5.72



0 TB

to

0.01 TB

40%

130

0.00

1.30

.0 TB



Grand total

127 TB

1,272 TB

699 TB




























Min (TB)

Max (TB)

Mean (TB)



Output data mastered in an external repository

-13 TB

-127 TB

-70 TB



Total Leeds archive data predicted per year at full influx rate

114 TB

1,145 TB

629 TB



New archive data from projects predicted in year 1

11 TB

114 TB

63 TB



New archive data from projects predicted in year 2

34 TB

343 TB

189 TB






















Assumptions























Percentage of data mastered in an external repository

10%

(Estimate based on talking to various faculties about their use of external repositories)







Year 1 estimated percentage of full influx rate

10%

(Because most projects already in flight will not use the service)







Year 2 estimated percentage of full influx rate

30%

(Because most projects already in flight will not use the service, but there will be more than in year 1)

 

 

I’ve sent this information to Robert already who did some quick calculations after looking at the model and reckons that his figures are commensurate based on a crude scaling.

 

I like the sort of log10 approach to this estimation and that a min, max and mean has been calculated to give an idea of the uncertainties and variance.

 

I might be allowed to share some further details of the Research Data Leeds business cases, but as things are I am not to share the full details yet. This is partly because of strategic concerns to do with competition and procurement and because of the uncertainties involved. More details are likely to be shared when the University has progressed further with implementation.

 

HTH

 

Andy
http://www.geog.leeds.ac.uk/people/a.turner/index.html
 

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Andy Turner
Sent: 08 January 2015 12:53
To: [log in to unmask]
Subject: Re: Data repository storage volumes and growth

 

Hi Robert, List,

 

I think this has been touched on before on this list, (sorry, but it may have been another, or I may be mistaken, but there is some information somewhere about this)… The best I’ve found searching for a specific thread on this list is this one on “Research data quota takeup”:

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1410&L=RESEARCH-DATAMAN&D=0#28

 

This relates back in a way to Simon’s question:

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=RESEARCH-DATAMAN;f3c2c500.1205

 

There is I think a power law type of distribution to this. In your institution there will be some researchers/research groups that have a large production and storage requirement of data, there will also be a lot of researchers with a relatively small storage requirement, but this all can add up to something significant.

 

There is a big difference in storing sensitive data and data that can be made more openly available, so you might want to try to estimate these volumes separately.

 

The University of Leeds developed a business case probably similar to what you are doing a bit over a year ago. I could ask about sharing some details of this with you if you want.

 

Best wishes,

 

Andy
http://www.geog.leeds.ac.uk/people/a.turner/index.html
 

From: Research Data Management discussion list [mailto:[log in to unmask]] On Behalf Of Robert Darby
Sent: 08 January 2015 11:01
To: [log in to unmask]
Subject: Data repository storage volumes and growth

 

Hello

 

I am currently working with colleagues at the University of Reading on a business case for a research data repository and we wanted to define some cost parameters for our archive storage requirement over the next five years. I am interested to know if anybody has attempted to model expected archive storage volumes over a 3-5 period, or, where services have already been established, if anyone can share data about year-on-year growth in storage volumes.

 

To be clear: this is the storage requirement specifically for archiving/publishing data supporting published outputs in compliance with EPSRC and other public funders’ policies, where suitable external data centres cannot be used. Our business case will recommend implementing a service integrating EPrints and Arkivum, and we hope to begin implementation in early 2015. We are expecting to begin with a narrow compliance-focused data collection policy and that during the first year or two we will effectively be in a pilot phase with relatively low usage. It is assumed that as the service becomes more established the collection policy may broaden to include data arising from other research not funded by the big public funders and data from unfunded research.

 

I therefore assumed in years 1 and 2 a requirement for maybe 1-5 TB storage, with a more steeply rising curve in years 3-5 to <100TB by year 5. The general view of my colleagues is that this is far too low. But I’m willing to throw it in as a reference point to get things started…

 

I realise there are so many variables in the mix that any meaningful numbers or comparisons between organisations are probably not possible, but I would be interested at least to have a sense of the scales of actual/projected storage others are working with. Does anybody out there have any relevant information they would be willing to share?

 

I should greatly appreciate any help!

 

Thank you

 

Robert

 

Dr Robert Darby

Research Data Management Project Manager

Research and Enterprise Development

The University of Reading
Whiteknights
Reading RG6 6AH
Tel: 0118 378 6161

[log in to unmask]

 




-- 
Dr Angus Whyte
Senior Institutional Support Officer
Digital Curation Centre
University of Edinburgh
Crichton St, Edinburgh EH8 9LE
+44-131-650-9980
 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-- 
Dr Angus Whyte
Senior Institutional Support Officer
Digital Curation Centre
University of Edinburgh
Crichton St, Edinburgh EH8 9LE
+44-131-650-9980

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.