On Aug 4, 2010, at 2:29 PM, David Wilson-Okamura wrote:
> I'm not aware, though, of anything comparable with the statistics on
> vocabulary that Hart compiled for Shakespeare. For example: how many
> different words are there in his whole corpus?
11,091. This is available in WordHoard by selecting Find --> Find
Lemmata, and adding "Corpus is Spenser" as the sole search criterion.
Or you can get the same list with more detail about frequencies by
choosing Analysis --> Create Word Form List and select Lemma as the
word form type and Spenser as the text. Martin mentioned this in the
post that Anne forwarded, but I thought I'd explain how to do it and
put it back in the context of David's original question.
> How many words does FQ I-III share with IV-VI, or with Mut.?
I don't think WordHoard can do that exactly. You can use Analysis -->
Compare Texts to get some statistical estimates of similarity between
two texts, and you can use Analysis --> Compare Many Word Forms to see
which words are most likely and least likely to occur in both texts.
> The FQ is long, but is it also dense, lexically?
>
> What's "lexically dense"? I specifically don't mean "richly ambiguous"
> or "pleasantly polysemous." I'm talking about something more basic (or
> just stupid): how many different words does a poet use in a given
> number of lines? A few minutes ago, I typed one of Hart's tables into
> a spreadsheet and added a "density" calculation: number of distinct
> words in a play or poem divided by the number of lines. (How do you
> count lines of prose? Hart covered that in an earlier article.) Turns
> out that "Venus and Adonis" (1.76) and "Lucrece" (1.52) have the
> highest vocab density of anything Shakespeare wrote, including Hamlet
> (1.03), Lear (1.04), Macbeth (1.27), and Othello (.95). To put the
> numbers for "Venus" and "Lucrece" in perspective, the mean vocab
> density for the whole corpus is 1.06. Dude!
Now that I've actually gone back and read the message that started
this thread I see that the ratio you are interested in is lemmata per
line, whereas what I calculated was the word count / lemma ratio. For
your method, a higher number implies greater density (which has a nice
intuitive feel to it), whereas for mine, a lower number implies
greater density, with 1.0 meaning a word is never used twice. It
should amount to much the same thing, though given the varying lengths
of lines of verse I think word counts might be a more reliable basis
than line counts.
> I'd like to see if there's a similar difference in Spenser's corpus,
> between narrative, hymn, complaint, and lyric.
On reflection, I don't think it's going to be very easy to do this in
a meaningful way because texts of differing length really aren't
comparable. It might seem, intuitively, that since we're normalizing
by the number of lines or the number of words, the densities between
texts of disparate length should be comparable, but it doesn't
actually work that way. In general, the longer the text, the lower
the density; dilation trumps lexical innovation. You can see this in
a scatter plot of the data I posted yesterday:
<https://spreadsheets.google.com/oimg?key=0AsoJsnHCshCydC1zTVFKSG9DYU9fYmdzMDFMZTRBV1E&oid=1&zx=nk12zqieapgz
>
The longer the text or segment of text, the lower the density (bigger
number on my chart). The differences between texts of similar length
is very small compared to the differences between texts of differing
length. This makes (common) sense. The longer you talk, the higher
the probability that each additional word you utter will be a word
you've used before. The lexically innovative powers of a Spenser or a
Shakespeare might stave off this inevitability just a bit longer than
the rest of us could do, but the constraints of memory and
comprehensibility win out as a text grows in length.
Now, not wanting to take away all David's fun I should point out an
accident of my analysis, which is that I reported data for the FQ by
canto, not book. So the density of a 3,000- to 4,000-word canto would
actually be comparable to that of Epithalamion, Muiopotmos, and some
others.
________________________________________
Craig A. Berry
mailto:[log in to unmask]
"... getting out of a sonnet is much more
difficult than getting in."
Brad Leithauser
|