From: Peter Graham, Rutgers University Libraries The following 2-page excerpt from a dissertation recently submitted at Rutgers (which happened to catch my eye in the cataloging department) may be of interest to meta2 people who follow what's happening in parallel streams. Comments back to me will be irrelevant. Richard D. Holowczak, Extractors for Digital Library Objects, dissertation submitted to the Graduate School--Newark, of Rutgers, the State University of New Jersey (Newark: May, 1997). <excerpt from Chapter 1, Introduction, pp. 2-4> 1.1.2 Metadata Based Retrieval Currently, the query/retrieval process hinges on two distinct paradigms. In the first, objects are retrieved based on queries to metadata which are descriptors of what the object contains. For example, metadata for a musical score might include the title, composer, key, tempo, dates composed, date published and publisher. Retrieval of musical scores from a collection can only be accomplished by querying within the metadata. Metadata based queries are efficient and accurate and almost all computerized library searching is done in this fashion. The shortcomings of metadata are they are static meaning users are unable to add new attributes, they are constructed a priori the retrieval process for efficiency, and querying must be done within the metadata attributes provided. Then in a library, all new books must be cataloged electronically by book title, author, ISBN number, publisher, publication date, number of pages, etc. This is currently a manageable task performed by librarians every day as new books arrive. However, if some new metadata attribute were to be added such as "Hard cover or Paperback", the task of updating this attribute for all books currently in the library would be insurmountable. 1.1.3 Content Based Retrieval In the second paradigm, digital library objects are retrieved based on the actual content itself. A content-based query searches for occurrences of a particular pattern specified in the query within a collection. Content-based queries are most often associated with text documents where the presence of keywords is used to determine the relevance of the document. In general, keyword searches and matching are poor at recall and accuracy [9] even when thesauri are consulted for keyword synonyms [23]. This is because users are forced to supply search terms that match the exact words used in the documents as opposed to interacting with more generally accepted concepts. Some recent work has also been done for image data where the contents of images are matched against a pattern indicated by the user. Such approaches remain highly domain-specific, however. Content based queries are also limited in the types of queries they can satisfy. It is not possible to query an object based on inferred concept not directly contained in the object. For example, the combinations of words in a book may elicit a sorrowful reaction from a reader yet we are not able to query a library for "books that make me sad." 1.2 Extractors The goal of this research is to merge the generality of content-based searching with the structure and accuracy of metadata based queries by developing a uniform mechanism for deriving metadata from the digital library objects. The broad term given to this mechanism is an extractor. Extractors are computer programs that extract concepts from information content and make these concepts a part of the metadata (what we call a conceptual index) which can then be queried. A concept is an abstraction of the underlying data or information that more closely matches the notions and thoughts of the users. A general example of extractors can be seen in Figure 1.1. Concepts from a science fiction novel might include "protagonist" and "means of transportation" while concepts from an autobiography might include "birth place", "education" and "first job". Extractors for other data types can also be considered. For example, it would be desirable to have an extractor capable of eliciting concepts about the content of a picture or map, or about the theme of a movie. For this dissertation, we propose to formalize the methodology of extractors for text documents in a digital library and to develop methods for managing the extracted conceptual indexes. Our general approach is to provide each object in the digital library with a text summary. Extractors will operate on each summary to produce conceptual indexes that are then made available for searching. </excerpt from Chapter 1, Introduction, pp. 2-4> *********************************************************************** ******* New area code required in November, is usable now (was 908) ****** Peter Graham [log in to unmask] Rutgers University Libraries 169 College Ave., New Brunswick, NJ 08903 (732)445-5908; fax(732)445-5888 <URL:http://aultnis.rutgers.edu/pghome.html>