Harry, Aurélien et al,
I see stopwords as a research tool, not necessarily a resource-saving fix.
For one of my papers where corpus exploration was being done, the
reviewers were curious to know whether applying certain procedures would
affect the final findings. The possibility to plug in (or out) a
well-documented (and critically evaluated) stopwords list makes for an
interesting experiment -- for students and for researchers both.
Not to mention that one line of research can focus precisely on words from
the stopwords list.
Best,
Neven
> You are right about speed and (online) databases, Harry, though I don't
> think that efficiency was the focus here.
>
> We fully agree that the ability to enable, update and disable stoplists is
> essential in a research environment. However, the discussion with Maurizio
> was not about reaching consensus, but rather, to rephrase my suggestion,
> about the implications of making documented stoplists, or sets of possible
> stopwords, openly available. As the wiki page mentions, I guess much
> wheel-reinventing is going on in that domain. Collecting more examples on
> the wiki could be worthwhile to document different aims, methods and
> results.
>
> One interesting case is the Perseus stoplists, which are so valuable and
> often used. How many users notice they contain the erroneous "adhic" for
> "adhuc" and the very unfrequent ???? and ???? (I?m still puzzled and
> amused
> by the fact these words have found their way in other tools)? And who
> knows
> how they were designed?
>
> I started using Voyant Tools as a teaching tool, since it makes corpus
> exploration so easy, before students are able to make their own frequency
> lists. One of the prominent features of the platform is that it offers a
> wordcloud as soon as the corpus is loaded. Understanding and iteratively
> shaping such visualisations, thanks to the frequency lists and the
> stoplists, is a good introductory exercise.
>
> These are only Saturday thoughts I am sharing. I?d be happy to know what
> digital classicists do when they need stoplists in their research, and not
> only in their day-to-day queries in databases over which they have no
> control at all.
>
> Best wishes,
>
> Aurélien
>
>
> On 26 Jan 2018, at 18:34, harry diakoff <[log in to unmask]> wrote:
>
> Stopword lists are really only justifiable by specific research interests.
> The default should always be no stopwords, with the ability of the user to
> implement any stopword list they wish, as long as it is readily
> discoverable. I'm not sure that there is any point in trying to develop a
> reasonable consensus about stopword lists since research interests will
> vary so greatly and unpredictably. With any modern inverted index
> full-text
> database speed should not be a consideration across all of digitized
> classical literature.
>
>
>
> On Fri, Jan 26, 2018 at 11:57 AM, Aurélien Berra
> <[log in to unmask]>
> wrote:
>
>>
>> I'm not sure I see your point here, Maurizio. We probably agree that
>> there
>> is no ideal stoplist. The lists should be corpus-based, implementing a
>> statistical threshold (with or without a shared static core), and
>> iterative, in relation to successive interests. Obviously, in an
>> environment where the user cannot choose or update the stoplist, the
>> default list can be designed in various ways. And techniques like phrase
>> search introduce other approaches.
>>
>> Cari saluti,
>>
>> Aurélien
>>
>>
>> On 26 Jan 2018, at 14:21, maurizio lana <[log in to unmask]> wrote:
>>
>> so my next question arises: can one practically define/individuate the
>> set
>> of stopwords for own text(s)?
>>
>>
>>
>
|