Print

Print


Some folks on this list know that I’m involved in a project to metrically tag all of Latin and Greek verse. I’m some way into this, but I’ve also been working on producing pos-tagged texts for use in my teaching. I’m at a point where I’d like to make sure the two endeavours can work together. And so a lazy decision (surprise) has come back to haunt me: in my verse texts there is a simple hierarchy: line > speaker > word > syllable. This means that when a spoken syllable contains parts of two words (as often with elision in Greek) I tag only as one word. This isn’t good enough for pos-tagging.

I’m not aware of previous work on this, and I have a solution in mind, but I wonder if someone has in fact dealt with this before. The proposed solution is below, but please bear in mind a couple of things before commenting:

- the main challenge is the conflict between the logic of the spoken unit and the lexical unit. Syllables are spoken things, but words are dictionary things. 
- I’m not aiming for TEI compliance (though I have checked to see if there is a TEI based solution - did I miss one?). I do, however, want to be sure the results can be easily reformatted as TEI compliant by anyone who cares to do so (especially anyone who might want to use the data in a TEI-compliant database).
- I am aiming for maximal structural/semantic clarity. Broadly speaking, tagged items should rely as little as possible on information to be found in other tagged items. When they do so, those other items should be parents/grandparents etc. (e.g. inheriting is OK; “before” and “after” type tags are not).
- I work with html because that’s the medium of publication, but the sensibility is xml.

So here’s where I’m at right now: comments/critique welcome.

<div data-type="line" data-number="23" data-metre=“justanexample">
<span data-type=“speaker” data-name=“Somebody">
<span data-type="locution" data-wordlength="2" data-lemma-1="δέ" data-pos-1="c--------" data-lemma-2="ἔπος" data-pos-2="n-s---nn-">
<span data-type="syllable" data-length="short">
<span data-type="subsyllable" data-word="1">δ’</span>
<span data-type="subsyllable" data-word="2">ἔ</span>
</span>
<span data-type="syllable" data-length="long" data-modification="position">
<span data-type="subsyllable" data-word="2">πος</span>
</span>
</span>
<span data-type="locution" data-wordlength="1" data-lemma-1="τοι" data-pos-1=“g--------">
<span data-type="syllable" data-length="short" data-modification="synizesis">
<span data-type="subsyllable" data-word="1">τοι</span>
</span>
</span>
</span>
</div>

p.s. I’ve thought about trying to adapt this to/from text-to-speech xml (e.g. https://console.bluemix.net/docs/services/text-to-speech/SSML.html#ssml), but am trying to learn how to stop taking on unreasonably large projects.

David Chamberlain
Deptartment of Classics
University of Oregon
https://hypotactic.com


To unsubscribe from the DIGITALCLASSICIST list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DIGITALCLASSICIST&A=1