Hi everyone,
At the BFI we're experimenting with genderising moving image makers in our database, using a common approach: comparing given names with names datasets that have gender properties. (I know, it's a binary model, it doesn't face up to trans, inter, neutral, and the other wonderful complexities - but we're focusing on film history, 20th century, so it's defensible for most of the corpus).
We're building some Python tools to iterate through our person entities, and compare each name (actually, set of names, including Used For entities and multiple given names) with a databank, and pass four results into the output: m, f, ?, n/a - with ? indicating ambiguity or neutrality, and n/a indicating names not in our databank, etc
There's lots of stuff out in the wild achieving similar results - eg https://genderize.io/ - but we're integrating with our Adlib API so building it out ourselves (although obviously learning from genderize.io etc).
We're getting very good results, using primarily the official boys / girls names dataset from the Office for National Statistics:
http://www.ons.gov.uk/ons/datasets-and-tables/index.html?pageSize=50&sortBy=none&sortDirection=none&newquery=baby+names
By the way we identified a set of names shared between the male and female sets, as neutral / ambiguous, and we return a ? in those cases.
We're also starting to make use of other names datasets - for example the Carnegie Mellon University: http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/
And we're hoping to augment with datasets representing key diasporan contributors to British filmmaking - for example: https://archive.org/details/india-names-dataset
Googling finds lots of possible resources, eg this Stack Overflow discussion:
http://stackoverflow.com/questions/818203/does-anyone-know-of-a-good-library-for-mapping-a-persons-name-to-his-or-her-gen
So my question to the group is: can you direct us to good open-licensed gender-qualified given-names datasets? Do you have some you have developed for this purpose, or can you point me at URLs?
We can share the toolset with the community once we've perfected it, if that's of interest. Probably most interesting to Adlib users, as we're building call / parse methodologies for the Adlib API (asynchronous calls, multi-thread, getting good speeds). So it may be of interest to Adlib users with Python skills and names data. Sounds like a lonely hearts for informatics geeks.
Thanks in advance,
Stephen
****************************************************************
website: http://museumscomputergroup.org.uk/
Twitter: http://www.twitter.com/ukmcg
Facebook: http://www.facebook.com/museumscomputergroup
[un]subscribe: http://museumscomputergroup.org.uk/email-list/
****************************************************************
|