One of the techniques I experimented with was topic modelling.
Basically, we assume that documents are about things and that these
things can be identified by looking at the words that comprise the
documents.
We identify groups of words which appear in close proximity to each
other and then look at how these are distributed throughout the corpus.
But probably you should read David Blei
on this since he invented one of the most popular mathematical
approaches for topic modelling. What we end up with are a range of
topics we reckon a collection covers and an understanding of which
documents are strong in which topics. For example, a text such as
‘Biggles Takes It Rough’ might, among a collection of other books, rate
comparatively highly for topics like ‘aircraft’ and not very high for
topics like ‘quantum physics’ or ‘toxic masculinity’.
http://bit.ly/2BCZvHIhttp://bit.ly/2BCZvHI+--
Peterk
Dallas, Tx
[log in to unmask]Save our in-boxes! http://emailcharter.org
“If only there were a massive entity that I were forced to fund to tell
me how I should live my life, since I’m so obviously incapable of
deciding for myself.” M. Hashimoto
Contact the list owner for assistance at