Hi Gregg,
may be I'm a little late, but because our own project has to deal with
similar problems, I'd like to outline our work flow. I once created
a poster which shows an generalised approach to the matter, but it's
still only available in german. (If someone wants to have a look:
http://pdr.bbaw.de/workshop/poster/datenarchaeologie).
The main problem is the implicit nature of "typographical markup" in
most word files and their proprietary format. Our first step is to
export strict xhtml (other xml format would be usable too) which of
course contains all data (text) with all typographical information in a
machine- (and human-) readable format. Nothing new so far.
The second step is to create a hopefully small program in Perl or Java
(or other programming languages with very good handling of regular
expressions). There are three purposes of such a script.
First is to be able to read and interpret the xhtml files and to write
the explicit data to resulting xml files (EpiDoc, an own xml format,
whatever). This is quite the same way other replies to your question
suggested. But here are some advances in using a custom-made program. :)
The second purpose of such a program is to handle special situations,
which are unavoidable (resulting from manually collected data by
various editors within years or decades) and have to be solved by small
modifications and additions to the script.
If available, the creators and editors of the word files can give
valuable hints here.
The third purpose of the script depends on the availability of
appropriate resources. When someone transforms data from one
format to another (which is in most case a quite complex task),
there can be expended a little more effort, to enrich resulting
documents. It's simply the right point in time to do so. :)
For instance, we developed some web services which are able to recognize
placenames and dates in our texts. Now we use them to include
appropriate markup (e.g. including ISO-8601 date format etc.) as part of
every transformation process.
I hope this doesn't contribute to any type of confusion. Of course,
sources like web services may be queried by Oxygen, too. Two questions
questions have to be answered before going our way:
How many documents have to be transformed? -- The more the better. OK,
you got only one file, but this doesn't matter, it's probably quite
large. ;)
Are there good reasons and accessible resources for the mentioned third
purpose? -- I'd be surprised, if not.
Best Regards,
Fabian.
On 16.07.2011 19:01, Schwendner, Gregg wrote:
> I want to convert a put a some data I have compiled in a MS word file into a database that can ultimately be turned into a file compatible with epidoc. I presume this means something easily written in / easily converted to XML.
> Does this mean FIlemaker Pro exclusively? Would i be better off putting it in Epidoc directly (apart from the problem of lack of access to epidoc training seminars here)?
>
> Thanks,
> G W Schwendner
--
Berlin-Brandenburgische Akademie der Wissenschaften
DFG-Projekt "Personendatenrepositorium"
Fabian Körner
Jägerstrasse 22/23
10117 Berlin
http://www.bbaw.de
eMail [log in to unmask]
phone +49 (0)30 20370 285
http://wiki.digitalclassicist.org/User:FabianKoerner
|