Mark,
On 2006 Aug 1 , at 16.50, Mark Taylor wrote:
>> If you add UTYPEs to your published data, then you have at least
>> documented what you intend that data to mean, via a dereferencable
>> URL, whether or not any of the stuff I talk about below ever actually
>> happens.
>
> I think that's true, and makes utypes worth using, but we're still
> talking about semantic value which is only accessible to humans.
At this stage, yes. Making that semantic information practically
useful to machines is of course the trick.
>>> You've got to decide whether you're looking
>>> at UCD1 or UCD1+, attempt to make sense of what a load of words
>>> separated by semicolons mean, decide whether, say, phot.mag.reddFree
>>> is an acceptable stand-in for phot.mag, think about whether you
>>> need to perform unit conversions for the quantity that you've
>>> identified to mean what you think it means...
>
> and I forgot to add: what do you do if there are multiple columns
> which
> have the UCD you're looking for?
If UCDs are all you have, then you will be stuck a lot of the time,
because UCDs are, by design, not necessarily specific enough to drive
processing, but from this point of view can only be a fall-back.
What should be possible with the combination of UCDs and UTYPE is to
start with the possibly multiple UTYPEs annotating a column, and:
1. Do you recognise any of these, as strings? If so, you're done.
2. Get a list of things (UTYPEs and UCDs) that are equivalent to, or
more general than, that list of UTYPEs, in increasing order of
generality: do you recognise any of these, as strings (I think
parsing UCDs is probably unlikely to help much)? If so, you're done.
3. Oh well. Start grubbing around in the column names, and applying
all the heuristics you currently apply.
Step 2 is supposed to make things easier. If folk do start
annotating with UTYPEs, and an adequate network of relations can be
built up, then that step will disambiguate columns with identical
UCDs; it'll tell you, without you having to parse anything, that
phot.mag.reddFree has a more specific meaning than phot.mag; and
given that there are relations between UCD1 and the mutating
vocabulary list of UCD1+, you don't have to worry about the
difference there either. So it does at least address the UCD
problems you noted. [`adequate', here, means `enough to make this
work', and I don't have reliable intuition about how much that
actually is.]
Now, that isn't supposed to be magic, and there's a fair amount of
labour involved there in declaring the relations, but a scenario like
that is I think realistic.
And if it doesn't work, you're not any worse off than you were before.
> My feeling is that most of
> the questions to which UCDs/utypes appear to provide an answer
> are ones which actually require a human in the loop. For example,
> there may well be no correct answer to "are any of these utypes
> like phot.mag?", even given a well-defined state of a particular
> data processing system, because it depends on the kind of analysis
> that the scientist using the software has got in mind at the time.
Your second remark is very true. Context sometimes matters, and
while it should be possible to work that in to the reasoning I'm
talking about, it'll be at least harder. Your first remark can only
really be answered by trying it, though I'd be more optimistic about
it than you would be I think.
Also, following Malcolm's remark:
> Is full automation critical? I can envisage where you want to know
> whether a registered data source has relevant fields, but at some
> point
> you will need to consult detailed information on the semantics of
> columns to discover if a particlar catalogue is pertinent to the
> specific research project. Is it a big deal that some natural
> intelligence is brought to bear? The astronomer can judge whether the
> match is adequate. The UCDs can help identify potential matches. [It
> would be handy to be able to see the matches in order of likelihood
> (expert system even) and then to be able to click on each to read
> descriptions of the column, and then choose the appropriate matches.]
Yes: there will probably be plenty of cases where you only need to
know with some degree of approximation what a column is. There will
be other cases where you need to know exactly, and the application
will be written or configured so that if it doesn't recognise the
UTYPE, then it shouldn't try reasoning about it, or it might let the
user veto the deduced match.
But as I say, I should go ahead and try it. Can anyone point me
towards some list of documented column names?
And I should write shorter messages. I'm only thinking aloud, I
suppose.
See you,
Norman
--
------------------------------------------------------------------------
----
Norman Gray / http://nxg.me.uk
eurovotech.org / University of Leicester, UK
|