This first bit isn't quite on topic (check for the on-topic bit later in
the message), but here are a few steps that I've found you need to
follow in order to get a Mail repository working effectively:
1) The first step is choosing the best means for archiving the mail.
For example, systems such as hypermail
(http://www.landfield.com/hypermail/) and MHonArc
(http://www.oac.uci.edu/indiv/ehood/mhonarc.doc.html) provide an
interface that is okay for very low volume mail archives - once you get
a couple of hundred messages, the index page becomes really large.
However, they convert mail into "threaded" discussions, and save the
files as HTML format, making it easy to present through a Web interface.
2) The second step is choosing a means to extract the "metadata" for the
document. Stuff like DC.Creator and DC.Date is easy enough - they come
from fields in the mail message. You could roll your own PERL script, or
try products such as BlueAngel (http://www.blueangeltech.com/) which are
designed around metadata harvesting. Or perhaps you'd like a full-text
indexing engine? Try something like the "Open Source" Perlfect Search
(http://www.perlfect.com/freescripts/search/), or a commercial product
like Verity Search 97 (http://www.verity.com) or Fulcrum
(http://www.pcdocs.com/Products/index.htm)
3) Finally, you need to figure out how to write an interface to your
mail archive.
At my place, the mail becomes part of a huge repository combining
project notes, product documentation, quotes and reports. We're about to
implement "Show me stuff about..." feature, which will use subject codes
on all electronic documents and digital surrogates to return a list of
documents that are, for example about "current projects", or "customer
X".
Of course, we also have the obligatory text-box-with-search-button :)
*** ***
*** Now for the closer-to-topic bit: ***
*** ***
As far as extracting metadata from email goes, how's this for starters?
DC.Title = SUBJECT:
DC.Description = Summary or snippet of body of message
DC.Subject = automatically extracted keywords or subject codes
DC.Creator = (directory entry for) user specified in the FROM: field
DC.Contributor = empty, or fill it in with the author that wrote the
message that this message is a response to.
DC.Publisher = Company
DC.Date = Date that mail was sent
DC.Identifier = message ID from SMTP headers
DC.Relation = link to message that this message is a response to, and
links to documents referred to in the body
DC.Coverage = empty
DC.Rights = (c) Company
Now here I get mixed up again :)
DC.Type = electronic document
DC.Format = text/plain
One way to "automatically extract keywords" is to use a technique
similar to the Fujitsu "Heart" project. Basically, have a database of
words that are important to you/your organisation. Scan the incoming
messages for these words. If the words appear in the body, add them in
to the DC.Subject field. You may also decide that certain words in the
body should result in other words or codes being inserted into the
DC.Subject field.
For example, if you have a project called "Widgets" which involves a
partner company "ACME" and a few people like "John Doe" and "Brigit
Smyth", then when you find "ACME", "John Doe" or "Brigit" in your
messages, you might just stick "Widgets Project" in the DC.Subject
field.
DC.Coverage could also be automagically populated in this way. For
example, if the message mentions "Pharohs", you might add in coverage
elements for the region of Egypt amd the date span of the Ancient
Egyptian civilisation. Of course, you'd probably want to be more careful
about that kind of thing, lest you end up assigning DC.Coverage the
appropriate "geographic region" value for the Sahara desert everytime
someone mentions Camel cigarettes.
Is that enough food for thought?
Alex
Simon Pockley wrote:
>
> Help and wisdom needed ...
>
> We are currently trying to evolve a system for archiving
> our organisation's important email on a server.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|