Dear list members,
At the beginning of this year we released Trismegistos Words (https://www.trismegistos.org/words), an online interface allowing you to browse a fully automatically annotated version of the Greek documentary papyrus corpus (ca. 4.5 million words) for morphology and lemmata. I have included the original announcement (by Mark Depauw on PAPY) below. The data used for this project (i.e. the XMLs of the annotated texts) have now also been made available to the public, on https://github.com/alekkeersmaekers/duke-nlp.
Dear list members,
Trismegistos is pleased to announce a new database: TM Words (www.trismegistos.org/words). It contains the just under 5 million words contained in the Duke Database of Documentary Papyri. The new database is the result of work by Alek Keersmaekers, who started from the XML-version available on GitHub on 19 September 2016. He used a stochastic machine-learning approach for tokenisation, part-of-speech-tagging and lemmatization [I had to look all of these up too ;-)]. The accuracy is about 95%, which seems high, but also means that there are still about 250,000 errors of morphological interpretation in the database, some of which very obvious for humans. We would be very grateful if you would communicate errors you notice by giving us a ’thumbs-down’ and clicking the 👎 icon after each attestation. On the basis of that feedback we can improve the database further.
We have made the online version as user-friendly as possible, with many possibilities for filtering and automated weighed-dates charts. This obviously is very demanding for our server, and we hope that the system won’t crash as a result. In any case for some large datasets (very common words) you may need to wait half a minute or more.
A special feature is the possibility to look for attestations of words in specific genres of texts. This is only possible through cooperation with Joanne Stolk, who has undertaken a rough classification in the margin of her work on TM Text Irregularities.
Finally: all of this is only possible thanks to the existence of the DDbDP and papyri.info. In the future we hope to work together with them to share all information and make the lemmatisation available there as well. This will be a non-trivial matter, because of the dynamic nature of the text in the DDbDP. Nevertheless it is an urgently needed effort to prevent the creation of multiple versions of the same text. For that reason we will share all corrections as much as possible, and new readings should of course continue to be entered through the Papyrological Editor.