JISCMail - RESEARCH-DATAMAN Archives

Hi Andy,

Thanks again for sharing these estimates and for clarifying the points 
below.  Anything further you can share on the list in due course would 
be useful I'm sure.  I appreciate its difficult to share specifics of 
business cases. I notice some storage costs are being shared on 
www.curationexchange.org <http://www.curationexchange.org> BTW if that's 
of interest.

On the specific estimates for Leeds, as you say there are a lot of 
uncertainties and the figures seem very sensitive to the assumptions 
made. At the risk of sounding glib I just wanted to add the observation 
that being selective about what's archived, so you exclude data that can 
be easily reproduced (if it can) could make a big difference, especially 
if focused on the 10% of projects you expect to produce more than 
1Tb/year.  That would cut costs or allow you to keep additional copies 
and reduce risks for data of high value (e.g. where the consequences of 
loss are high).

Also presumably as the projects at the top end of that range are 
exceptional demands on your storage the costs can be charged to grants, 
and the proportion of the costs that can be recovered that way will 
increase in future years.

Anyway, thanks again. I've found your and Tim Banks's posts on this to 
the list really informative.

Regards, Angus


On 20/01/2015 12:19, Andy Turner wrote:
>
> Hi Angus,
>
> I’ve asked a colleague for clarification, but I might not get any. The 
> detail is probably in one of the full business case documents which I 
> am not to share at the moment. I could try to find that to be sure if 
> it was very important, (I don’t think it is, but if you really want me 
> to try to find this out for sure, let me know).
>
> I think, but I’m not sure that the figures reflect:
>
> The spread of data volumes based on previous projects and research 
> group future projections. (Estimates based on partial information 
> gleaned from the more engaged research groups)
>
> Archived data being data stored after a projects funding period concludes.
>
> Single copy volumes.
>
> Additionally I think the figures are also for RCUK+ (RCUK, Horizon 
> 2020, CRUK and Wellcome) funded projects only.
>
> Multiple copies are generally wanted, but in some cases a single copy 
> might be fine (where the data can be reproduced from source via 
> relatively inexpensive computation), two copies might be fine also for 
> some data stored only on campus. I think with three copies, we would 
> pretty much always be storing the third copy off campus.
>
> I’ll post back if I get confirmation/clarification on this.
>
> HTH
>
> Andy
>
> *From:*Research Data Management discussion list 
> [mailto:[log in to unmask]] *On Behalf Of *Angus Whyte
> *Sent:* 13 January 2015 12:29
> *To:* [log in to unmask]
> *Subject:* Re: Data repository storage volumes and growth
>
> Hi Andy,
>
> Thanks for sharing this information. It looks a useful and 
> straightforward approach. I wondered if you could shed any light on  
> other assumptions;
> - is the spread of data volumes based on previous projects, or the 
> relevant PI/group projections about future projects?
> - do the data ranges refer to all data collected in projects, or to 
> some proportion of that expected to be retained at the end?
> -does the 'total Leeds archive data' refer to a single or multiple 
> archival copies, i.e. how many copies will you have in total?
>
> Regards,
>
> Angus
>
>
>
> On 13/01/2015 11:02, Andy Turner wrote:
>
>     Hi List,
>
>     Apologies if the attachment does not come through, but there
>     should be an Excel spreadsheet attached to this message. I’ve
>     pasted in the content too, so apologies if that is formatted
>     horribly by email.
>
>     During the work on the Research Data Leeds business case a simple
>     model for the University-wide volume and projected volume of
>     research data to be archived was constructed. The model is based
>     on the number of awards per year and a simple model for the spread
>     of data volumes across the awards (based on liaison with a
>     Research Data Management Working Group). Our enterprise architect
>     specialising in research data has excerpted the relevant section
>     from our overall business costing model and got permission for me
>     to share this openly.
>
>     *Repository data volume analysis - Business-as-usual projects*
>
>
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>
>     	
>
>     *Project output data range*
>
>     	
>
>     *%age of projects*
>
>     	
>
>     *Number
>     awards
>     per year*
>
>     	
>
>     *Total data predicted per year
>     (TB)*
>
>     	
>
>     	
>
>     *Min*
>
>     	
>
>     *Max*
>
>     	
>
>     *Mean*
>
>     	
>
>     	
>
>     10 TB
>
>     	
>
>     to
>
>     	
>
>     100 TB
>
>     	
>
>     3%
>
>     	
>
>     10
>
>     	
>
>     97.43
>
>     	
>
>     974.25
>
>     	
>
>     535.84
>
>     	
>
>     	
>
>     1 TB
>
>     	
>
>     to
>
>     	
>
>     10 TB
>
>     	
>
>     7%
>
>     	
>
>     23
>
>     	
>
>     22.73
>
>     	
>
>     227.33
>
>     	
>
>     125.03
>
>     	
>
>     	
>
>     0.1 TB
>
>     	
>
>     to
>
>     	
>
>     1 TB
>
>     	
>
>     18%
>
>     	
>
>     58
>
>     	
>
>     5.85
>
>     	
>
>     58.46
>
>     	
>
>     32.15
>
>     	
>
>     	
>
>     0.01 TB
>
>     	
>
>     to
>
>     	
>
>     0.1 TB
>
>     	
>
>     32%
>
>     	
>
>     104
>
>     	
>
>     1.04
>
>     	
>
>     10.39
>
>     	
>
>     5.72
>
>     	
>
>     	
>
>     0 TB
>
>     	
>
>     to
>
>     	
>
>     0.01 TB
>
>     	
>
>     40%
>
>     	
>
>     130
>
>     	
>
>     0.00
>
>     	
>
>     1.30
>
>     	
>
>     .0 TB
>
>     	
>
>     	
>
>     *Grand total*
>
>     	
>
>     *127 TB*
>
>     	
>
>     *1,272 TB*
>
>     	
>
>     *699 TB*
>
>     	
>
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>
>     	
>     	
>     	
>     	
>     	
>     	
>
>     *Min (TB)*
>
>     	
>
>     *Max (TB)*
>
>     	
>
>     *Mean (TB)*
>
>     	
>
>     	
>
>     *Output data mastered in an external repository*
>
>     	
>
>     -13 TB
>
>     	
>
>     -127 TB
>
>     	
>
>     -70 TB
>
>     	
>
>     	
>
>     *Total Leeds archive data predicted per year at full influx rate*
>
>     	
>
>     114 TB
>
>     	
>
>     1,145 TB
>
>     	
>
>     629 TB
>
>     	
>
>     	
>
>     *New archive data from projects predicted in year 1*
>
>     	
>
>     11 TB
>
>     	
>
>     114 TB
>
>     	
>
>     63 TB
>
>     	
>
>     	
>
>     *New archive data from projects predicted in year 2*
>
>     	
>
>     34 TB
>
>     	
>
>     343 TB
>
>     	
>
>     189 TB
>
>     	
>
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>
>     */Assumptions/*
>
>     	
>     	
>     	
>     	
>     	
>     	
>
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>     	
>
>     	
>     	
>     	
>     	
>     	
>     	
>
>     /Percentage of data mastered in an external repository/
>
>     	
>
>     10%
>
>     	
>
>     /(Estimate based on talking to various faculties about their use
>     of external repositories)/
>
>
>     	
>     	
>     	
>     	
>     	
>     	
>
>     /Year 1 estimated percentage of full influx rate/
>
>     	
>
>     10%
>
>     	
>
>     /(Because most projects already in flight will not use the service)/
>
>
>     	
>     	
>     	
>     	
>     	
>     	
>
>     /Year 2 estimated percentage of full influx rate/
>
>     	
>
>     30%
>
>     	
>
>     /(Because most projects already in flight will not use the
>     service, but there will be more than in year 1)/
>
>     I’ve sent this information to Robert already who did some quick
>     calculations after looking at the model and reckons that his
>     figures are commensurate based on a crude scaling.
>
>     I like the sort of log10 approach to this estimation and that a
>     min, max and mean has been calculated to give an idea of the
>     uncertainties and variance.
>
>     I might be allowed to share some further details of the Research
>     Data Leeds business cases, but as things are I am not to share the
>     full details yet. This is partly because of strategic concerns to
>     do with competition and procurement and because of the
>     uncertainties involved. More details are likely to be shared when
>     the University has progressed further with implementation.
>
>     HTH
>
>     Andy
>     _http://www.geog.leeds.ac.uk/people/a.turner/index.html_
>
>     *From:*Research Data Management discussion list
>     [mailto:[log in to unmask]] *On Behalf Of *Andy Turner
>     *Sent:* 08 January 2015 12:53
>     *To:* [log in to unmask]
>     <mailto:[log in to unmask]>
>     *Subject:* Re: Data repository storage volumes and growth
>
>     Hi Robert, List,
>
>     I think this has been touched on before on this list, (sorry, but
>     it may have been another, or I may be mistaken, but there is some
>     information somewhere about this)… The best I’ve found searching
>     for a specific thread on this list is this one on “Research data
>     quota takeup”:
>
>     https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1410&L=RESEARCH-DATAMAN&D=0#28
>
>     This relates back in a way to Simon’s question:
>
>     https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=RESEARCH-DATAMAN;f3c2c500.1205
>
>     There is I think a power law type of distribution to this. In your
>     institution there will be some researchers/research groups that
>     have a large production and storage requirement of data, there
>     will also be a lot of researchers with a relatively small storage
>     requirement, but this all can add up to something significant.
>
>     There is a big difference in storing sensitive data and data that
>     can be made more openly available, so you might want to try to
>     estimate these volumes separately.
>
>     The University of Leeds developed a business case probably similar
>     to what you are doing a bit over a year ago. I could ask about
>     sharing some details of this with you if you want.
>
>     Best wishes,
>
>     Andy
>     _http://www.geog.leeds.ac.uk/people/a.turner/index.html_
>
>     *From:*Research Data Management discussion list
>     [mailto:[log in to unmask]] *On Behalf Of *Robert Darby
>     *Sent:* 08 January 2015 11:01
>     *To:* [log in to unmask]
>     <mailto:[log in to unmask]>
>     *Subject:* Data repository storage volumes and growth
>
>     Hello
>
>     I am currently working with colleagues at the University of
>     Reading on a business case for a research data repository and we
>     wanted to define some cost parameters for our archive storage
>     requirement over the next five years. I am interested to know if
>     anybody has attempted to model expected archive storage volumes
>     over a 3-5 period, or, where services have already been
>     established, if anyone can share data about year-on-year growth in
>     storage volumes.
>
>     To be clear: this is the storage requirement specifically for
>     archiving/publishing data supporting published outputs in
>     compliance with EPSRC and other public funders’ policies, where
>     suitable external data centres cannot be used. Our business case
>     will recommend implementing a service integrating EPrints and
>     Arkivum, and we hope to begin implementation in early 2015. We are
>     expecting to begin with a narrow compliance-focused data
>     collection policy and that during the first year or two we will
>     effectively be in a pilot phase with relatively low usage. It is
>     assumed that as the service becomes more established the
>     collection policy may broaden to include data arising from other
>     research not funded by the big public funders and data from
>     unfunded research.
>
>     I therefore assumed in years 1 and 2 a requirement for maybe 1-5
>     TB storage, with a more steeply rising curve in years 3-5 to
>     <100TB by year 5. The general view of my colleagues is that this
>     is far too low. But I’m willing to throw it in as a reference
>     point to get things started…
>
>     I realise there are so many variables in the mix that any
>     meaningful numbers or comparisons between organisations are
>     probably not possible, but I would be interested at least to have
>     a sense of the scales of actual/projected storage others are
>     working with. Does anybody out there have any relevant information
>     they would be willing to share?
>
>     I should greatly appreciate any help!
>
>     Thank you
>
>     Robert
>
>     Dr Robert Darby
>
>     Research Data Management Project Manager
>
>     Research and Enterprise Development
>
>     The University of Reading
>     Whiteknights
>     Reading RG6 6AH
>     Tel: 0118 378 6161
>
>     [log in to unmask] <mailto:[log in to unmask]>
>
>
>
>
> -- 
> Dr Angus Whyte
> Senior Institutional Support Officer
> Digital Curation Centre
> University of Edinburgh
> Crichton St, Edinburgh EH8 9LE
> +44-131-650-9980
>   
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.


-- 
Dr Angus Whyte
Senior Institutional Support Officer
Digital Curation Centre
University of Edinburgh
Crichton St, Edinburgh EH8 9LE
+44-131-650-9980

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.