It all started as a little data science project, possibly a job interview question for applicants: How would you compute the number of entries on Wikipedia.
The idea was to use large keyword lists (say 5,000,000) and check how many keywords from these lists have a Wikipedia entry, using a web crawler to run 5,000,000 searches on Wikipedia. Based on the number of Wikipedia entries found in your list (say 400,000), you would estimate the size of Wikipedia (say 6,000,000 articles).
Wikipedia has actually very precise statistics and historical data about its size and growth, which makes my project even more interesting: You can start with your own statistical model and keyword lists, and then check it against real data!
http://bit.ly/18HKOwH
You may leave the list at any time by sending the command
SIGNOFF allstat
to [log in to unmask], leaving the subject line blank.
|