Dear Aurélien,
I, for one, much appreciate your stopwords initiative; in my opinion, it
opens a new possibilities of research and exploration for Greek and Latin.
Regarding the Digital Classicist wiki, I would suggest you publish there a
link to the Github repo. You're probably familiar with Zenodo and OSF -- a
project in one of these scholarly frameworks would be a good way to make
your work more citable and preservable (and Github repositories can be
integrated into both: <http://help.osf.io/#github>,
<https://guides.github.com/activities/citable-code/>).
Best,
Neven
Neven Jovanovic
Department of Classical Philology
Faculty of Humanities and Social Sciences
University of Zagreb
Croatia
>
> Dear all,
>
> Since my first message about Greek and Latin stopwords, I have redesigned
> my lists, rebasing them on corpus analyses and series of tests. The new
> versions have recently been implemented in Voyant Tools, thanks to Stéfan
> Sinclair. My goal has been to propose rather extensive, and therefore
> aggressive lists, making it possible to quickly reveal salient lexical
> items. I have documented the process (motivation, corpus and code,
> allographs, Unicode Greek issues):
>
> * Rationale and history:
> https://github.com/aurelberra/stopwords/blob/master/rationale.md
> <https://github.com/aurelberra/stopwords/blob/master/rationale.md>
> * Revision notes:
> https://github.com/aurelberra/stopwords/blob/master/revision_notes.md
> <https://github.com/aurelberra/stopwords/blob/master/revision_notes.md>
> * Voyant Tools GitHub issue:
> https://github.com/sgsinclair/Voyant/issues/382
> <https://github.com/sgsinclair/Voyant/issues/382>
>
> I take this occasion to thank Peter Heslin for his reply here (not to
> mention the invaluable resources offered in his Diogenes software). It
> convinced me that I couldn't only rely on previous, hardly documented
> lists, but had to introduce a statistical approach. However, in the
> context of Voyant Tools, I needed static lists and had no access to
> lemmatisation, so my needs were quite different from that of the search
> engine in a digital library or the custom, on-the-fly stoplists that a
> toolkit like the CLTK will make available.
>
> I will go on testing the lists on various corpora ? I see, for example,
> that some dialectal forms from Herodotus should be added.
>
> What is the best way to update the Digital Classicist wiki page now? The
> short lists recorded there are only slightly expanded versions of the
> stoplists used in Perseus. A few quirks should be corrected. Maybe I could
> just clean up the page and add some links?
>
> I would be very grateful if anyone had feedback!
>
> Best wishes,
>
> Aurélien
>
>
>> On 16 Oct 2017, at 11:22, Peter Heslin <[log in to unmask]
>> <mailto:[log in to unmask]>> wrote:
>>
>> Dear Aurélien,
>>
>> The resources you have pointed to are a good starting point. But these
>> stop-word lists presume that you are processing unlemmatized Latin,
>> which I personally find to be an approach of limited interest. If you
>> are generating usage statistics on lemmatized Latin, you obviously need
>> to add common words that appear in many inflected forms. The lemmata I
>> have found necessary to add to the public lists you mention are these:
>>
>> sum, possum, facio, do, dico, video, fero, facio, meus, tuus, suus, res,
>> ille, hic, ipse, qui, quis, venio, habeo, omnis, voco, inquam
>>
>> I generated that list when looking at frequencies in a small subset of
>> Latin epic, so YMMV.
>>
>> Best,
>>
>> Peter
>>
>> On 14 October 2017 at 15:31, Aurélien Berra <[log in to unmask]
>> <mailto:[log in to unmask]>> wrote:
>> Dear all,
>>
>> When I became interested in stopwords a few years ago, I used and
>> updated the lists on the Digital Classicist wiki page. I am now trying
>> to suggest reasonable lists to be implemented in Voyant Tools. About a
>> week ago, I opened an issue to "Add default stopwords for Greek and
>> Latin". In the process I compared available lists (Perseus, CLTK and
>> others) and tried to grasp on what principles such a non-specialised
>> list should be based, although I am aware this is part of a broader
>> discussion about the flexible, iterative use of stopwords in research.
>>
>> The discussion can be found there:
>> https://github.com/sgsinclair/Voyant/issues/382
>> <https://github.com/sgsinclair/Voyant/issues/382>
>> https://github.com/aurelberra/stopwords/blob/master/elements_for_discussion.md
>> <https://github.com/aurelberra/stopwords/blob/master/elements_for_discussion.md>
>>
>> I would be grateful for comments and advice.
>>
>> Best wishes,
>>
>> Aurélien
>>
>
>
|