JISCMail - RESEARCH-DATAMAN Archives

Hi Al,

We have a highly structured archive of about 180 million files in circa 10m directories… we’ve coped with this using a separate data catalogue that has circa 5.5 k datasets in it that point each uniquely cover parts of the archive (i.e. point to a location in the directory tree below which are all the files/folders of the datasets).

However, we’re now moving over to using ElasticSearch as a higlhly scalable index of _all_ the files in the archive - including those that are not within datasets (e.g. metadata and doc folders). ElasticSearch is a great tool as it allows you a very flexible way to index everything and is set up with tooling that you can build interfaces ontop of.. plus it’s open source!

We’re now harvesting content from the files to populate the index – e.g. parameter info, geo-temporal info, file formats, sizes, etc etc - and linking these with our detailed catalogue entries, which will allow us to provide facetted searches and metadata harvesting at scale across a highly heterogeneous archive.

Let me know if you’d like more info.

graham

From: Research Data Management discussion list <[log in to unmask]> On Behalf Of Alastair Downie
Sent: 13 August 2018 11:17
To: [log in to unmask]
Subject: Content Management System for research data?

Hi all.

I had a conversation with one of our PIs recently - she’s becoming anxious about not being able to find data after researchers have left the lab. I talked her through the usual data management advice, about using a good naming convention and a well-organised directory structure etc, but she showed me her lab’s filestore folder and it’s clear that she’s actually doing all of the above pretty well. The problem is the sheer volume of data and the rate at which it can be generated, that requires an almost cryptic naming convention to differentiate datasets. It’s starting to look like a traditional files & folders operating system will not cope well with *very* large research directory trees.

In most modern website content management systems, users deposit data (images etc) without having to see what disk they go onto, and without worrying about a directory structure. I’m wondering if the same might be possible for an entire institution’s research data, using an ELN (or other documentation system) as the management interface - so *all* discovery and access would be provided by verbose, plain English links in the description of the research, and users would not be permitted to see/operate a directory structure at all. I understand this is the basis of most data repositories - my idea is to extend this method into the labs, for management of ALL data, rather than just the tip of the iceberg which is the published work.

This would be a huge cultural change of course, and it’d be a challenge to convince researchers that they no longer need a file browser. A leap too far for many, I expect. It’s also not completely clear that this approach would be any more scalable than a traditional file browser approach. So I’d be very grateful to hear from anyone who might have considered this idea or experimented with it already - if the idea has even tiny wee buds where legs might one day grow, it’d be good to discuss further..

Thanks,

=====================================================
Alastair Downie (Head of IT)
The Gurdon Institute, University of Cambridge,
Tennis Court Road, Cambridge CB2 1QN, United Kingdom
Office: +44(0)1223 762556
Mobile: +44(0)7989 393304

=====================================================

To unsubscribe from the RESEARCH-DATAMAN list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=RESEARCH-DATAMAN&A=1