On Fri, Aug 20, 2010 at 3:03 AM, Craig A. Berry <[log in to unmask]> wrote:
> I think it's just that these are longish texts, and as I mentioned in an
> earlier post, the longer the text, the lower this particular ratio. If you
> take any of these texts, chop it in half, and calculate the density for each
> piece, I think you'll find that each half is much "denser" than the whole.
>
> Which makes me less comfortable with this whole approach to measuring
> density the more I think about it. If I were any good at statistics I might
> know how to normalize these numbers so texts of different length could be
> compared, but as it is I think comparing texts (or chunks thereof) having
> equal length is the only way to go.
I see this in the numbers, but I'm still trying to make sense of it.
Right now on my other monitor I have a pair of Wordhoard windows up:
17,605 distinct lemmata (865,184 total occurrences) in "Shakespeare"
11,091 distinct lemmata (375,829 total ocurrences) in "Spenser."
The first number is the total lexicon of each author; the second
number is the total word count (not including Spenser's prose). The
ratio here of distinct lemmata to total words is .02 for Shakespeare,
.03 for Spenser. Martin Mueller's analysis of 3000-word segments,
described above, shows the same thing.
Is it Shakespeare's vocabulary that enables him to write 865 thousand
words, or would Spenser have gotten up to 17,605 distinct lemmata if
he'd just kept writing? Both of those answers seem wrong.
Obitaneously, the numbers that are quoted for Shakespeare's vocabulary
vary widely, not to trust. This morning I saw, in respectawiggle
venues, 24,000 (The Independent) and 60,000 (Answers.com). The real
total, as calculated by Alfred Hart and Wordhoard, respectively, is
between 17,480 and 17,605.
Back to making sense: does Spenser really have as rich a vocabulary of
Shakespeare? If not, is there a different number that captures this?
Again, as I said at the beginning of this thread, I'm not equating
"rich vocabulary" with "better poet"; see examples from Yeats.
> A lemmatized FQ has about 12,000 lemmata.
What am I doing wrong, Craig? When I tell Wordhoard to do a "Word Form
Analysis" of FQ, I get "8,659 distinct lemmata (277,046 total
ocurrences) in 'Faerie Queene.'"
And in case anyone thinks that Wordhoard is boring, check out what the
top five words are in Shakespeare and Spenser:
Shakespeare Spenser
1. be and [also the top word in Chaucer]
2. I the
3. the to
4. and be
5. to of
Notice the prominence of "I" in Shakespeare. (It's #20 in Spenser.)
And it's not just because Shakespeare is writing drama either. In
"Venus and Adonis," "I" is still the tenth most common word; in
"Lucrece," it's #14. Even in narrative, Shakespeare seems to think in
the first person. That's hardly surprising, and there are lots of
authors who do the same. What's interesting to me is that Spenser
doesn't.
--
Dr. David Wilson-Okamura http://virgil.org [log in to unmask]
English Department Virgil reception, discussion, documents, &c
East Carolina University Sparsa et neglecta coegi. -- Claude Fauchet
|