Print

Print


One of the techniques I experimented with was topic modelling. Basically, we assume that documents are about things and that these things can be identified by looking at the words that comprise the documents. 1 We identify groups of words which appear in close proximity to each other and then look at how these are distributed throughout the corpus. But probably you should read David Blei on this since he invented one of the most popular mathematical approaches for topic modelling. What we end up with are a range of topics we reckon a collection covers and an understanding of which documents are strong in which topics. For example, a text such as ‘Biggles Takes It Rough’ might, among a collection of other books, rate comparatively highly for topics like ‘aircraft’ and not very high for topics like ‘quantum physics’ or ‘toxic masculinity’.


http://bit.ly/2BCZvHI
http://bit.ly/2BCZvHI+

--
Peterk
Dallas, Tx
[log in to unmask]
Save our in-boxes! http://emailcharter.org
“If only there were a massive entity that I were forced to fund to tell me how I should live my life, since I’m so obviously incapable of deciding for myself.” M. Hashimoto
Contact the list owner for assistance at [log in to unmask]

For information about joining, leaving and suspending mail (eg during a holiday) see the list website at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=archives-nra