Hi folks, we’re currently looking at ways of improving the way we deliver the largest datasets on
https://data.bl.uk/ to users. The largest today are several hundred GB, and it won’t be long before we’re into the TB. Large downloads can be a pain for users because they take a long time and can easily be interrupted. They
also potentially present a significant cost to us as a data provider because outbound bandwidth costs can be high for data stored in the cloud. I know this is something that many people in the community will already have grappled with so I’m hoping there will
be some experience to share.
Possibilities we’ve discussed so far include:
·
Just let people download over HTTP but advise use of a download manager to handle interruptions to the connection
·
Publish via BitTorrent, which has the advantage that if a large number of people are downloading the same thing at once (e.g. during a workshop) our outgoing bandwidth use could be significantly less than filesize × number of
people
·
Allow people to request a copy on disk via courier, probably charged to cover costs
·
Split datasets into smaller chunks to make it easier to get just the bit you need (but makes it more effort if you do want the whole lot)
·
Allow users to move their compute to the data, either in the cloud or by renting out space in our machine room (this is essentially
what AWS have done for some big public datasets)
·
Provide a dedicated API and/or UI to allow users to browse the collection and select a custom subset to download
The last two would probably be my preference in the long run, but require the most resource to set up and maintain.
There was also some discussion a few years ago of using GridFTP but I don’t know where those went.
Any advice would be most welcome and I’d be happy to summarise responses for the community. Feel free to reply to me directly, and I’ll withhold or anonymise your advice from the summary on request.
Many thanks,
Jez
|
|
Jez Cope
(he/him) Data Services Lead Research Services |
The British Library Boston Spa |
|
|
To unsubscribe from the RESEARCH-DATAMAN list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=RESEARCH-DATAMAN&A=1