Tom & Andrew ...
The pizza example is a great illustration. And while ordering would
provide an implicit prioritization, my instincts tell me that an explicit
capability is in order. For example, how do you know where the primary
keywords stop and the secondary keywords begin?
The more I think about it, the more I think there may be a serious need for
some qualified Subject meta data.
Theoretically, resource discovery will be iterative. On the first pass, I
may only want to retrieve resources that are _primarily_ about a given
keyword(s) and whose meta data has an official status (official position,
accepted, authorized, finalized, etc.). If I don't find what I'm looking
for, I might then expand my search net by accepting unvalidated meta data,
drafts, works in progress, unofficial positions and/or resources that are
not primarily about my topics, but reference them. On the final pass, if I
still don't have what I want, I would probably resort to full text search.
To support this ability to fine tune search scope we need a mechanism for
identifying two things about subject meta data
1) Is this good, _verified_ (knowledge managed) meta data: has anyone
else looked at it? has it been audited? can it been attested to?
2) Is this the primary topic for the resource, or a contributing topic?
There are several things going on here, including
1) has the meta data been physically validated? the scheme says dewey
decimal system - but is this a valid DDC classification? if it's free text,
has the keyword been spell-checked?
2) has it been intellectually validated? (more of an IntrAnet issue) are we
all using the same keywords to describe the same attributes, have we
addressed homographs like "STOCK" which could mean inventory, soup base, a
financial instrument, the butt of a gun ...
3) has it been attested? (more of an IntErnet issue) has an outside
authority audited and approved the meta data
A peripheral issue is the best method for representing overall resource
status: Draft, Final, Approved etc.
From: tom_wason on 02/24/98 01:52 PM
To: meta2 @ mrrl.lut.ac.uk
cc:
Subject: Re: Differentiating significance of DC.Subject meta data and
Yellow Pages
It would seem that the use of ORDERED LISTS in metadata fields is a
powerful tool. Search engines can use list order (1st is highest in
importance) to order search results. And users can request that only
the first n terms be searched. This encourages catalogers to provide as
much information about the object as possible.
Ordered list entry probably won't always be popular, but then, is that
the purpose of metadata? A cataloger may wish to increase the "hit"
rate on an object, but the user wants "best fit" results. An ordered
list provides an intermediary solution, allowing catalogers to enter
many terms. Users may elect not to limit searches by list order.
--Tom Wason
Andrew Waugh wrote:
>
> Peter,
>
> You are raising two separate issues:
> * The intentional addition of inaccurate metadata
> * Prioritising metadata to avoid misleading
>
> They are almost the converse of each other.
>
> I saw a good example of confusing metadata about two years ago when I
> was shown a prototype of an electronic 'yellow pages' kiosk. Searching
> this system for the term 'piza' bought up a list of section headings.
> 'Restaurants' was towards the bottom of the list. Above it were things
> like 'Timber Yards'. These, upon investigation, sold wood for wood fired
> piza ovens.
>
> Piza was not a section heading; the system was relying on the term
> occuring in the business name or the descriptive text (really keywords)
> supplied by the business. Unfortunately the answer most people were
> looking for came far down the list (and under a section heading most
> people would not have expected).
>
> An answer to this problem may be to prioritise the keywords (the
> prioritisation is what I would call an annotation to the value).
> However, a major problem with metadata is the cost of its production
> and capturing more metadata simply adds to this cost and potentially
> impeads the spread of metadata.
>
> It is probably more efficient *at the moment* to improve the way the
> response is presented to the user. In the piza example, it is likely
> that the greatest number of hits were in 'Restaurants'; that catagory
> should have been presented first. More difficult situations (like your
> beer example) would require clever heuristics... Result summarisation is
> a very interesting research area.
>
> andrew waugh
--
--------------------------------------------
Thomas D. Wason, Ph.D.
Director of Research and Evaluation
Institute for Academic Technology
University of North Carolina at Chapel Hill
730 Airport Rd., Suite 100
Chapel Hill, North Carolina 27599 USA
919.962.9286
919.962.4321 FAX
[log in to unmask]
....
|