Dear Neven (and dear all),
I followed your advice, cleaned up the repository and made it citable by publishing it as a dataset on Zenodo, which gives it a DOI.
https://github.com/aurelberra/stopwords
Thanks also to Matteo (Romanello), who suggested I should talk to Marco Passarotti.
Although the initial goal of this tiny project has been reached, since there are open and documented stoplists for Ancient Greek and Latin in Voyant Tools, I'm still interested in feedback and any kind of discussion on the topic.
Best wishes,
Aurélien
> On 24 Jan 2018, at 14:51, Neven Jovanovic <[log in to unmask]> wrote:
>
> Dear Aurélien,
>
> I, for one, much appreciate your stopwords initiative; in my opinion, it
> opens a new possibilities of research and exploration for Greek and Latin.
>
> Regarding the Digital Classicist wiki, I would suggest you publish there a
> link to the Github repo. You're probably familiar with Zenodo and OSF -- a
> project in one of these scholarly frameworks would be a good way to make
> your work more citable and preservable (and Github repositories can be
> integrated into both: <http://help.osf.io/#github>,
> <https://guides.github.com/activities/citable-code/>).
>
> Best,
>
> Neven
>
> Neven Jovanovic
> Department of Classical Philology
> Faculty of Humanities and Social Sciences
> University of Zagreb
> Croatia
>
>
>>
>> Dear all,
>>
>> Since my first message about Greek and Latin stopwords, I have redesigned
>> my lists, rebasing them on corpus analyses and series of tests. The new
>> versions have recently been implemented in Voyant Tools, thanks to Stéfan
>> Sinclair. My goal has been to propose rather extensive, and therefore
>> aggressive lists, making it possible to quickly reveal salient lexical
>> items. I have documented the process (motivation, corpus and code,
>> allographs, Unicode Greek issues):
>>
>> * Rationale and history:
>> https://github.com/aurelberra/stopwords/blob/master/rationale.md
>> <https://github.com/aurelberra/stopwords/blob/master/rationale.md>
>> * Revision notes:
>> https://github.com/aurelberra/stopwords/blob/master/revision_notes.md
>> <https://github.com/aurelberra/stopwords/blob/master/revision_notes.md>
>> * Voyant Tools GitHub issue:
>> https://github.com/sgsinclair/Voyant/issues/382
>> <https://github.com/sgsinclair/Voyant/issues/382>
>>
>> I take this occasion to thank Peter Heslin for his reply here (not to
>> mention the invaluable resources offered in his Diogenes software). It
>> convinced me that I couldn't only rely on previous, hardly documented
>> lists, but had to introduce a statistical approach. However, in the
>> context of Voyant Tools, I needed static lists and had no access to
>> lemmatisation, so my needs were quite different from that of the search
>> engine in a digital library or the custom, on-the-fly stoplists that a
>> toolkit like the CLTK will make available.
>>
>> I will go on testing the lists on various corpora ? I see, for example,
>> that some dialectal forms from Herodotus should be added.
>>
>> What is the best way to update the Digital Classicist wiki page now? The
>> short lists recorded there are only slightly expanded versions of the
>> stoplists used in Perseus. A few quirks should be corrected. Maybe I could
>> just clean up the page and add some links?
>>
>> I would be very grateful if anyone had feedback!
>>
>> Best wishes,
>>
>> Aurélien
>>
>>
>>> On 16 Oct 2017, at 11:22, Peter Heslin <[log in to unmask]
>>> <mailto:[log in to unmask]>> wrote:
>>>
>>> Dear Aurélien,
>>>
>>> The resources you have pointed to are a good starting point. But these
>>> stop-word lists presume that you are processing unlemmatized Latin,
>>> which I personally find to be an approach of limited interest. If you
>>> are generating usage statistics on lemmatized Latin, you obviously need
>>> to add common words that appear in many inflected forms. The lemmata I
>>> have found necessary to add to the public lists you mention are these:
>>>
>>> sum, possum, facio, do, dico, video, fero, facio, meus, tuus, suus, res,
>>> ille, hic, ipse, qui, quis, venio, habeo, omnis, voco, inquam
>>>
>>> I generated that list when looking at frequencies in a small subset of
>>> Latin epic, so YMMV.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> On 14 October 2017 at 15:31, Aurélien Berra <[log in to unmask]
>>> <mailto:[log in to unmask]>> wrote:
>>> Dear all,
>>>
>>> When I became interested in stopwords a few years ago, I used and
>>> updated the lists on the Digital Classicist wiki page. I am now trying
>>> to suggest reasonable lists to be implemented in Voyant Tools. About a
>>> week ago, I opened an issue to "Add default stopwords for Greek and
>>> Latin". In the process I compared available lists (Perseus, CLTK and
>>> others) and tried to grasp on what principles such a non-specialised
>>> list should be based, although I am aware this is part of a broader
>>> discussion about the flexible, iterative use of stopwords in research.
>>>
>>> The discussion can be found there:
>>> https://github.com/sgsinclair/Voyant/issues/382
>>> <https://github.com/sgsinclair/Voyant/issues/382>
>>> https://github.com/aurelberra/stopwords/blob/master/elements_for_discussion.md
>>> <https://github.com/aurelberra/stopwords/blob/master/elements_for_discussion.md>
>>>
>>> I would be grateful for comments and advice.
>>>
>>> Best wishes,
>>>
>>> Aurélien
>>>
>>
>>
|