Kevin,
I did this the dumb way, randomly selecting different sets of 3,000 words form the
entire Faerie Queene. My general attitude towards statistically driven stylometry is
that it is a very crude tool. It can pick up simple phenomena with great precision,
but you should always be aware of the simplicity and crudeness of the phenomena.
A subtler way of doing it would be to use a sliding window of 1,000 word contiguous
stretches and measure the lexical density at 250 word intervals. Represent the
results as a line graph with peaks and valleys. If it's a random walk, you've
learned something. One would of course be happier to find that the peaks and
valleys "make sense" to an expert Spenserian.
The most useful statistical tool in WordHoard, by the way, is the "Compare many
words" feature, which uses a statistical test known as the G- test or log likelihood
ratio. It compares two word lists and determines which words in one set are overused
or underused with regard to the rest of the corpus.
If you play this game with the Faerie Queene, comparing each book separately with
the entire work, by far the most striking result is the relative underuse of 'which'
in the first book of the Faerie Queene. The figures per 10,000 words in the seven
books (seven being the Mutabilitie Cantoes) are respectively 53,83, 80, 108, 103,
113, 92
I have no good explanation except perhaps that Spenser went "which" hunting in the
first book but gave it up. Or he fell into a habit.
Or this may be a fact without any interest whatever. Though I doubt this.
> Martin,
>
> I have a question about the experimental method you employed. Did the
> sets of 3000 words each consist of 3000 randomly selected words from the
> Faerie Queene and the Shakespeare canon; or were the 3000 word sets
> contiguous segments of words selected from random starting points within
> the Faerie Queene and the Shakespeare canon?
>
> That is, does your lexical density measure capture 3 separate points in
> time within the authors' literary career (this would be the case if the
> 3000 words were contiguous)? Or are you measuring lexical density over
> their life output (this would be the case if you analyzed 3 sets of 3000
> randomly selected words within the canons you're comparing)?
>
> My guess is that you randomly selected 3 sets of 3000 contiguous words.
> I'd think it would be difficult to interpret the alternative.
>
> Kevin
>
> --------------------------
>
> Martin Mueller wrote:
>> Using WordHoard
>> data, I selected three 3,000 word sets drawn at random from respectively the
>> Faerie
>> Queene and the Shakespeare canon.
>>
>
--
|