JISCMail - DC-GENERAL Archives

From:  Peter Graham, Rutgers University Libraries

The following 2-page excerpt from a dissertation recently submitted at
Rutgers (which happened to catch my eye in the cataloging department) may be
of interest to meta2 people who follow what's happening in parallel streams. 
Comments back to me will be irrelevant.

Richard D. Holowczak, Extractors for Digital Library Objects, dissertation
submitted to the Graduate School--Newark, of Rutgers, the State University of
New Jersey (Newark:  May, 1997). 

<excerpt from Chapter 1, Introduction, pp. 2-4>

1.1.2  Metadata Based Retrieval

Currently, the query/retrieval process hinges on two distinct paradigms.  In
the first, objects are retrieved based on queries to metadata which are
descriptors of what the object contains.  For example, metadata for a musical
score might include the title, composer, key, tempo, dates composed, date
published and publisher.
Retrieval of musical scores from a collection can only be accomplished by
querying within the metadata.  Metadata based queries are efficient and
accurate and almost all computerized library searching is done in this
fashion.
        The shortcomings of metadata are they are static meaning users are
unable to add new attributes, they are constructed a priori the retrieval
process for efficiency, and querying must be done within the metadata
attributes provided. Then in a library, all new books must be cataloged
electronically by book title, author, ISBN number, publisher, publication
date, number of pages, etc.  This is currently a manageable task performed by
librarians every day as new books arrive. However, if some new metadata
attribute were to be added such as "Hard cover or Paperback", the task of
updating this attribute for all books currently in the library would be
insurmountable.

1.1.3  Content Based Retrieval

In the second paradigm, digital library objects are retrieved based on the
actual content itself.  A content-based query searches for occurrences of a
particular pattern specified in the query within a collection.  Content-based
queries are most often associated with text documents where the presence of
keywords is used to determine the relevance of the document.  In general,
keyword searches and matching are poor at recall and accuracy [9] even when
thesauri are consulted for keyword synonyms [23].  This is because users are
forced to supply search terms that match the exact words used in the
documents as opposed to interacting with more generally accepted concepts. 
Some recent work has also been done for image data where the contents of
images are matched against a pattern indicated by the user.  Such approaches
remain highly domain-specific, however.
        Content based queries are also limited in the types of queries they
can satisfy.  It is not possible to query an object based on inferred concept
not directly contained in the object.  For example, the combinations of words
in a book may elicit a sorrowful reaction from a reader yet we are not able
to query a library for "books that make me sad."

1.2 Extractors

The goal of this research is to merge the generality of content-based
searching with the structure and accuracy of metadata based queries by
developing a uniform mechanism for deriving metadata from the digital library
objects.  The broad term given to this mechanism is an extractor.  Extractors
are computer programs that extract concepts from information content and make
these concepts a part of the metadata (what we call a conceptual index) which
can then be queried.  A concept is an abstraction of the underlying data or
information that more closely matches the notions and thoughts of the users. 
A general example of extractors can be seen in Figure 1.1.  Concepts from a
science fiction novel might include "protagonist" and "means of
transportation" while concepts from an autobiography might include "birth
place", "education" and "first job".  Extractors for other data types can
also be considered.  For example, it would be desirable to have an extractor
capable of eliciting concepts about the content of a picture or map, or about
the theme of a movie.
        For this dissertation, we propose to formalize the methodology of
extractors for text documents in a digital library and to develop methods for
managing the extracted conceptual indexes.  Our general approach is to
provide each object in the digital library with a text summary.  Extractors
will operate on each summary to produce conceptual indexes that are then made
available for searching.

</excerpt from Chapter 1, Introduction, pp. 2-4>
***********************************************************************

******* New area code required in November, is usable now (was 908) ******
Peter Graham     [log in to unmask]     Rutgers University Libraries
169 College Ave., New Brunswick, NJ 08903  (732)445-5908; fax(732)445-5888
               <URL:http://aultnis.rutgers.edu/pghome.html>