The Faerie Queene has 8659 distinct lemmata. The author Spenser has 11,091. So there
are about 2,400 lemmata that occur in the ~100,000 Spenser words that are not in the
Faerie Queene.
A word about lemmatization in WordHoard. Lemmatization is an art rather than a
science, and there are choices to be made. When in doubt, WordHoard is a 'lumper'
rather than a 'splitter'. The word forms in Chaucer, Spenser, and Shakespeare are
mapped, as much as possible, to forms that cut across time and genre, and make
cross-author comparisons possible.
The most unconventional of such lumping is the treatment of 'un-' words, which
have been defined as negative forms of positive lemmata. What is the common lemma
shared by 'unforgiven' and 'unforgiving'? WordHoard assigns these forms to 'forgive'
and flags them with '-u' in the morphological analysis. This reduces the lemma
count. It also lets you get at 'un-words' as a distinct category. Shakespeare is
uncommonly fond of these forms. If you treat 'unaccommodated' in King Lear as as
form of 'accommodate' and trace the lemma across its positive and negative forms,
you see interesting connections.
Harald Baayen, a Dutch linguist now at the University of Alberta, is the authority
on lemma counting. It's a mathematically tricky business (way beyond my pay grade),
but it's similar to predicting the number of species of, say, butterflies on the
basic of butterflies you have caught. He has written a whole and very technical book
about it. A shorter discussion occurs in his excellent handbook, Analyzing
Linguistic Data.
> On Fri, Aug 20, 2010 at 3:03 AM, Craig A. Berry <[log in to unmask]> wrote:
>> I think it's just that these are longish texts, and as I mentioned in an
>> earlier post, the longer the text, the lower this particular ratio. If you
>> take any of these texts, chop it in half, and calculate the density for each
>> piece, I think you'll find that each half is much "denser" than the whole.
>>
>> Which makes me less comfortable with this whole approach to measuring
>> density the more I think about it. If I were any good at statistics I might
>> know how to normalize these numbers so texts of different length could be
>> compared, but as it is I think comparing texts (or chunks thereof) having
>> equal length is the only way to go.
>
> I see this in the numbers, but I'm still trying to make sense of it.
> Right now on my other monitor I have a pair of Wordhoard windows up:
>
> 17,605 distinct lemmata (865,184 total occurrences) in "Shakespeare"
> 11,091 distinct lemmata (375,829 total ocurrences) in "Spenser."
>
> The first number is the total lexicon of each author; the second
> number is the total word count (not including Spenser's prose). The
> ratio here of distinct lemmata to total words is .02 for Shakespeare,
> .03 for Spenser. Martin Mueller's analysis of 3000-word segments,
> described above, shows the same thing.
>
> Is it Shakespeare's vocabulary that enables him to write 865 thousand
> words, or would Spenser have gotten up to 17,605 distinct lemmata if
> he'd just kept writing? Both of those answers seem wrong.
>
> Obitaneously, the numbers that are quoted for Shakespeare's vocabulary
> vary widely, not to trust. This morning I saw, in respectawiggle
> venues, 24,000 (The Independent) and 60,000 (Answers.com). The real
> total, as calculated by Alfred Hart and Wordhoard, respectively, is
> between 17,480 and 17,605.
>
> Back to making sense: does Spenser really have as rich a vocabulary of
> Shakespeare? If not, is there a different number that captures this?
> Again, as I said at the beginning of this thread, I'm not equating
> "rich vocabulary" with "better poet"; see examples from Yeats.
>
>> A lemmatized FQ has about 12,000 lemmata.
>
> What am I doing wrong, Craig? When I tell Wordhoard to do a "Word Form
> Analysis" of FQ, I get "8,659 distinct lemmata (277,046 total
> ocurrences) in 'Faerie Queene.'"
>
> And in case anyone thinks that Wordhoard is boring, check out what the
> top five words are in Shakespeare and Spenser:
>
> Shakespeare Spenser
> 1. be and [also the top word in Chaucer]
> 2. I the
> 3. the to
> 4. and be
> 5. to of
>
> Notice the prominence of "I" in Shakespeare. (It's #20 in Spenser.)
> And it's not just because Shakespeare is writing drama either. In
> "Venus and Adonis," "I" is still the tenth most common word; in
> "Lucrece," it's #14. Even in narrative, Shakespeare seems to think in
> the first person. That's hardly surprising, and there are lots of
> authors who do the same. What's interesting to me is that Spenser
> doesn't.
>
> --
> Dr. David Wilson-Okamura http://virgil.org [log in to unmask]
> English Department Virgil reception, discussion, documents, &c
> East Carolina University Sparsa et neglecta coegi. -- Claude Fauchet
>
--
|