JISCMail - DC-SCIENCE Archives

Email discussion lists for the UK Education and Research communities

Subscriber's Corner

Email Lists

DC-SCIENCE Archives

DC-SCIENCE@JISCMAIL.AC.UK

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		DC-SCIENCE Home
		DC-SCIENCE February 2009

Options

Subscribe or Unsubscribe

Get Password

Subject:

Re: [Fwd: Metadata for datasets]

From:

Joe Hourcle <[log in to unmask]>

Reply-To:

DCMI Science and Metadata Community <[log in to unmask]>

Date:

Tue, 24 Feb 2009 17:24:49 -0500

Content-Type:

TEXT/PLAIN

Parts/Attachments:

TEXT/PLAIN (151 lines)

On Tue, 24 Feb 2009, Stu Weibel wrote:

> Hi, Joe,
>
>> I'm probably too close to this subject, as I've been working with the
>> group developing SPASE, a metadata effort for describing the holdings of
>> space physics archives, and there are some issues with what metadata is
>> useful in different sub-communities, as well as some issues with
>> terminology and how the different communities group their data into
>> collections.

> Are your metadata standards available to the public?

Yes.  The current SPASE draft is at:

 	http://www.spase-group.org/data/doc/spase-1_3_4-draft.pdf

We hope to have it stable for version 1.4 in the next couple of months.

For solar physics, we use a much simpler model to describe the data:

 	http://vso1.nascom.nasa.gov/docs/wiki/DataModel18

(The thing is -- it's what the scientists wrote, and there's been a few 
proposed version 2.0s, but none of those stuck, and we've gone and 
implemented a few extra fields in our search API, and a few more that are 
returned ... see http://sdac.virtualsolar.org/API/VSO_API.html ... and 
there are more that we've been asked to add))

Most of the data is in "self-documenting" formats, such as FITS, HDF, CDF 
or NetCDF, but you often have to check the archive's documentation for use 
caveats to know what assumptions were made in processing the data.

> One of the things I'd like to see emerge from these discussions would be a
> lowest common denominator for data set metadata intended for discovery, and
> then specializations for particular domains.

Unfortunately, you use the term "data set".

The problem that we run into with Solar Physics is that you have to define 
a "data set" to then be able to describe it.  Most other space physics 
disciplines organize their data catalogs around the concept of "data 
series" (sometimes "data products") which have been processed in the same 
way, and then individual "data granules" that are individually resolvable.

We instead have catalogs of each of the individual images taken, and for 
each record track what its operating mode is and processing are.. 
Sometimes, there might be two similar modes that are cross-calibrated and 
merged to get higher cadence.  If you look closely at:

 	http://stereo-ssc.nascom.nasa.gov/cgi-bin/images/?Display=Slideshow;Resolution=512;Start=20070101;Finish=20070102;Detectors=ahead_cor2;Session=1;

you'll see a slight flicker to the top right of the occulting disk.  It's 
actually a processing artifact in every other image, because the 
instrument is flipping modes every other image.  The problem is much more 
obvious when there isn't a feature obscuring it:

 	http://stereo-ssc.nascom.nasa.gov/cgi-bin/images/?Display=Slideshow;Resolution=512;Start=20090101;Finish=20090102;Detectors=ahead_cor2;Session=1;

Sorry, again, I'm too close to the problem ... I've been venting for a 
while because many of the standards assume that there's a one-to-many 
relationship between "data sets" and "data granules", and for our images, 
it's a many-to-many relationship, as the "data set" is just how you've 
chosen to sample the larger collection.

> The structural metadata is a separate issue.  I wonder if there is likely to
> be a coherent TYPE vocabulary for such data, and whether it can be
> enumerated.  Are there tens/hundreds/thousands of different structure types
> for such data?  If there are tens or hundreds we have a chance of TYPEing
> them and providing managed schemas for them.

You might need to qualify 'structural'... I mean, there's the 'how it's 
written to the file' type structure, there's the 'what each field means' 
(more semantic) structure, and then we get into the issue of relationships 
between objects.  The first one we can get around by using one of the 
"self-documenting" formats.  The last one, well, that's a big can of worms 
... see some of my past presentations, where I discuss the problem, and 
propose a reference model for scientific catalogs based on FRBR:

 	http://vso1.nascom.nasa.gov/vso/misc/_README.txt
 	http://vso1.nascom.nasa.gov/vso/misc/AGU2008_DataRelationships.ppt

... and then we get to the 'semantic' issue.  For the data that I manage, 
you can typically describe the dimensions of the data.  For the Virtual 
Solar Observatory, the scientists defined eight layouts, but we're only 
currently only have data representing five of them:

 	http://sdac.virtualsolar.org/cgi/show_details.pl?keyword=DATA_LAYOUT

We can then combine that with the 'physical observable', and can come up 
with a good idea about what the data is:

 	http://sdac.virtualsolar.org/cgi/show_details.pl?keyword=PHYSOBS

The problem really comes with time series data, where it's just time 
plotted against _anything_.  There's a few controlled vocabularies, such 
as IVOA's UCD (Unified Content Descriptors):

 	http://cdsweb.u-strasbg.fr/UCD/ucd1p-words.txt  (the list)
 	http://www.ivoa.net/Documents/latest/UCD.html   (documentation)

In SPASE, the list (in my opinion) is much more complicated, as it's 
broken down into categories, and you have to know where to look all over 
the document:

 	Measured Parameters
 		Photons
 		Fields
 		Particles
 		Mixed
 	Support Parameters
 		Positional
 		Temporal
 		Other

There are other efforts, such as SESDI that are trying to model the 
different parameters using ontologies so they can do reasoning -- you're 
looking for (A), which we don't have, but you can compute it from (B) and 
(C).

 	http://sesdi.hao.ucar.edu/intro.php

... anyway, just looking at UCD+, they have almost 500 descriptors, but 
they're built up from a much smaller list.  SPASE uses a similar concept 
with 'Qualifiers' to try to keep the lists to a more manageable size.  I 
haven't looked at what they're doing with geographic and oceanography, but 
I know they've got a few efforts.  (MMI for marine data, but I'm drawing a 
blank on what the geo group is called):

 	http://marinemetadata.org/

(okay, and now to go and read up on the standards that other people have 
mentioned since I started writing this)

-Joe

Top of Message | Previous Page | Permalink

JiscMail Tools

Files Area | help

RSS Feeds and Sharing

Search Archives

Advanced Options