[log in to unmask]">

Since my first message about Greek and Latin stopwords, I have redesigned my lists, rebasing them on corpus analyses and series of tests. The new versions have recently been implemented in Voyant Tools, thanks to Stéfan Sinclair. My goal has been to propose rather extensive, and therefore aggressive lists, making it possible to quickly reveal salient lexical items. I have documented the process (motivation, corpus and code, allographs, Unicode Greek issues):

* Rationale and history: https://github.com/aurelberra/stopwords/blob/master/rationale.md

* Revision notes: https://github.com/aurelberra/stopwords/blob/master/revision_notes.md

* Voyant Tools GitHub issue: https://github.com/sgsinclair/Voyant/issues/382

I take this occasion to thank Peter Heslin for his reply here (not to mention the invaluable resources offered in his Diogenes software). It convinced me that I couldn't only rely on previous, hardly documented lists, but had to introduce a statistical approach. However, in the context of Voyant Tools, I needed static lists and had no access to lemmatisation, so my needs were quite different from that of the search engine in a digital library or the custom, on-the-fly stoplists that a toolkit like the CLTK will make available.

I will go on testing the lists on various corpora – I see, for example, that some dialectal forms from Herodotus should be added.

What is the best way to update the Digital Classicist wiki page now? The short lists recorded there are only slightly expanded versions of the stoplists used in Perseus. A few quirks should be corrected. Maybe I could just clean up the page and add some links?

I would be very grateful if anyone had feedback!

Best wishes,

Aurélien

On 16 Oct 2017, at 11:22, Peter Heslin <[log in to unmask]> wrote:

Dear Aurélien,

The resources you have pointed to are a good starting point. But these stop-word lists presume that you are processing unlemmatized Latin, which I personally find to be an approach of limited interest. If you are generating usage statistics on lemmatized Latin, you obviously need to add common words that appear in many inflected forms. The lemmata I have found necessary to add to the public lists you mention are these:

sum, possum, facio, do, dico, video, fero, facio, meus, tuus, suus, res, ille, hic, ipse, qui, quis, venio, habeo, omnis, voco, inquam

I generated that list when looking at frequencies in a small subset of Latin epic, so YMMV.

Best,

Peter

On 14 October 2017 at 15:31, Aurélien Berra <[log in to unmask]> wrote:

Dear all,

When I became interested in stopwords a few years ago, I used and updated the lists on the Digital Classicist wiki page. I am now trying to suggest reasonable lists to be implemented in Voyant Tools. About a week ago, I opened an issue to "Add default stopwords for Greek and Latin". In the process I compared available lists (Perseus, CLTK and others) and tried to grasp on what principles such a non-specialised list should be based, although I am aware this is part of a broader discussion about the flexible, iterative use of stopwords in research.

The discussion can be found there:
https://github.com/sgsinclair/Voyant/issues/382
https://github.com/aurelberra/stopwords/blob/master/elements_for_discussion.md

I would be grateful for comments and advice.

Best wishes,

Aurélien