On Tue, 24 Feb 2009, Stu Weibel wrote:
> Hi, Joe,
>
>> I'm probably too close to this subject, as I've been working with the
>> group developing SPASE, a metadata effort for describing the holdings of
>> space physics archives, and there are some issues with what metadata is
>> useful in different sub-communities, as well as some issues with
>> terminology and how the different communities group their data into
>> collections.
> Are your metadata standards available to the public?
Yes. The current SPASE draft is at:
http://www.spase-group.org/data/doc/spase-1_3_4-draft.pdf
We hope to have it stable for version 1.4 in the next couple of months.
For solar physics, we use a much simpler model to describe the data:
http://vso1.nascom.nasa.gov/docs/wiki/DataModel18
(The thing is -- it's what the scientists wrote, and there's been a few
proposed version 2.0s, but none of those stuck, and we've gone and
implemented a few extra fields in our search API, and a few more that are
returned ... see http://sdac.virtualsolar.org/API/VSO_API.html ... and
there are more that we've been asked to add))
Most of the data is in "self-documenting" formats, such as FITS, HDF, CDF
or NetCDF, but you often have to check the archive's documentation for use
caveats to know what assumptions were made in processing the data.
> One of the things I'd like to see emerge from these discussions would be a
> lowest common denominator for data set metadata intended for discovery, and
> then specializations for particular domains.
Unfortunately, you use the term "data set".
The problem that we run into with Solar Physics is that you have to define
a "data set" to then be able to describe it. Most other space physics
disciplines organize their data catalogs around the concept of "data
series" (sometimes "data products") which have been processed in the same
way, and then individual "data granules" that are individually resolvable.
We instead have catalogs of each of the individual images taken, and for
each record track what its operating mode is and processing are..
Sometimes, there might be two similar modes that are cross-calibrated and
merged to get higher cadence. If you look closely at:
http://stereo-ssc.nascom.nasa.gov/cgi-bin/images/?Display=Slideshow;Resolution=512;Start=20070101;Finish=20070102;Detectors=ahead_cor2;Session=1;
you'll see a slight flicker to the top right of the occulting disk. It's
actually a processing artifact in every other image, because the
instrument is flipping modes every other image. The problem is much more
obvious when there isn't a feature obscuring it:
http://stereo-ssc.nascom.nasa.gov/cgi-bin/images/?Display=Slideshow;Resolution=512;Start=20090101;Finish=20090102;Detectors=ahead_cor2;Session=1;
Sorry, again, I'm too close to the problem ... I've been venting for a
while because many of the standards assume that there's a one-to-many
relationship between "data sets" and "data granules", and for our images,
it's a many-to-many relationship, as the "data set" is just how you've
chosen to sample the larger collection.
> The structural metadata is a separate issue. I wonder if there is likely to
> be a coherent TYPE vocabulary for such data, and whether it can be
> enumerated. Are there tens/hundreds/thousands of different structure types
> for such data? If there are tens or hundreds we have a chance of TYPEing
> them and providing managed schemas for them.
You might need to qualify 'structural'... I mean, there's the 'how it's
written to the file' type structure, there's the 'what each field means'
(more semantic) structure, and then we get into the issue of relationships
between objects. The first one we can get around by using one of the
"self-documenting" formats. The last one, well, that's a big can of worms
... see some of my past presentations, where I discuss the problem, and
propose a reference model for scientific catalogs based on FRBR:
http://vso1.nascom.nasa.gov/vso/misc/_README.txt
http://vso1.nascom.nasa.gov/vso/misc/AGU2008_DataRelationships.ppt
... and then we get to the 'semantic' issue. For the data that I manage,
you can typically describe the dimensions of the data. For the Virtual
Solar Observatory, the scientists defined eight layouts, but we're only
currently only have data representing five of them:
http://sdac.virtualsolar.org/cgi/show_details.pl?keyword=DATA_LAYOUT
We can then combine that with the 'physical observable', and can come up
with a good idea about what the data is:
http://sdac.virtualsolar.org/cgi/show_details.pl?keyword=PHYSOBS
The problem really comes with time series data, where it's just time
plotted against _anything_. There's a few controlled vocabularies, such
as IVOA's UCD (Unified Content Descriptors):
http://cdsweb.u-strasbg.fr/UCD/ucd1p-words.txt (the list)
http://www.ivoa.net/Documents/latest/UCD.html (documentation)
In SPASE, the list (in my opinion) is much more complicated, as it's
broken down into categories, and you have to know where to look all over
the document:
Measured Parameters
Photons
Fields
Particles
Mixed
Support Parameters
Positional
Temporal
Other
There are other efforts, such as SESDI that are trying to model the
different parameters using ontologies so they can do reasoning -- you're
looking for (A), which we don't have, but you can compute it from (B) and
(C).
http://sesdi.hao.ucar.edu/intro.php
... anyway, just looking at UCD+, they have almost 500 descriptors, but
they're built up from a much smaller list. SPASE uses a similar concept
with 'Qualifiers' to try to keep the lists to a more manageable size. I
haven't looked at what they're doing with geographic and oceanography, but
I know they've got a few efforts. (MMI for marine data, but I'm drawing a
blank on what the geo group is called):
http://marinemetadata.org/
(okay, and now to go and read up on the standards that other people have
mentioned since I started writing this)
-Joe
|