Print

Print


One of the techniques I experimented with was topic modelling. Basically,
we assume that documents are about things and that these things can be
identified by looking at the words that comprise the documents. 1
<http://blog.nationalarchives.gov.uk/blog/read-43000-cabinet-papers/#note-36403-1>
We identify groups of words which appear in close proximity to each other
and then look at how these are distributed throughout the corpus. But
probably you should read David Blei on this since he invented one of the
most popular mathematical approaches for topic modelling. What we end up
with are a range of topics we reckon a collection covers and an
understanding of which documents are strong in which topics. For example, a
text such as ‘Biggles Takes It Rough’ might, among a collection of other
books, rate comparatively highly for topics like ‘aircraft’ and not very
high for topics like ‘quantum physics’ or ‘toxic masculinity’.


http://bit.ly/2BCZvHI
http://bit.ly/2BCZvHI+

-- 
Peterk
Dallas, Tx
[log in to unmask]
Save our in-boxes! http://emailcharter.org
“If only there were a massive entity that I were forced to fund to tell me
how I should live my life, since I’m so obviously incapable of deciding for
myself.” M. Hashimoto

Contact the list owner for assistance at [log in to unmask]

For information about joining, leaving and suspending mail (eg during a holiday) see the list website at
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=archives-nra