Fascinating work!
Cameron Blevins and Lincoln Mullen have written an R package that can infer gender from personal names, based on large historical datasets: http://www.digitalhumanities.org/dhq/vol/9/3/000223/000223.html
They might be able to publish a dataset of gendered names from different periods based on it - it's probably worth asking them if it's not immediately obvious.
Cheers, Mia
Sent from my handheld computing device
> On 5 Feb 2016, at 15:32, Stephen McConnachie <[log in to unmask]> wrote:
>
> Hi everyone,
>
> At the BFI we're experimenting with genderising moving image makers in our database, using a common approach: comparing given names with names datasets that have gender properties. (I know, it's a binary model, it doesn't face up to trans, inter, neutral, and the other wonderful complexities - but we're focusing on film history, 20th century, so it's defensible for most of the corpus).
>
> We're building some Python tools to iterate through our person entities, and compare each name (actually, set of names, including Used For entities and multiple given names) with a databank, and pass four results into the output: m, f, ?, n/a - with ? indicating ambiguity or neutrality, and n/a indicating names not in our databank, etc
>
> There's lots of stuff out in the wild achieving similar results - eg https://genderize.io/ - but we're integrating with our Adlib API so building it out ourselves (although obviously learning from genderize.io etc).
>
> We're getting very good results, using primarily the official boys / girls names dataset from the Office for National Statistics:
> http://www.ons.gov.uk/ons/datasets-and-tables/index.html?pageSize=50&sortBy=none&sortDirection=none&newquery=baby+names
>
> By the way we identified a set of names shared between the male and female sets, as neutral / ambiguous, and we return a ? in those cases.
>
> We're also starting to make use of other names datasets - for example the Carnegie Mellon University: http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/
>
> And we're hoping to augment with datasets representing key diasporan contributors to British filmmaking - for example: https://archive.org/details/india-names-dataset
>
> Googling finds lots of possible resources, eg this Stack Overflow discussion:
> http://stackoverflow.com/questions/818203/does-anyone-know-of-a-good-library-for-mapping-a-persons-name-to-his-or-her-gen
>
> So my question to the group is: can you direct us to good open-licensed gender-qualified given-names datasets? Do you have some you have developed for this purpose, or can you point me at URLs?
>
> We can share the toolset with the community once we've perfected it, if that's of interest. Probably most interesting to Adlib users, as we're building call / parse methodologies for the Adlib API (asynchronous calls, multi-thread, getting good speeds). So it may be of interest to Adlib users with Python skills and names data. Sounds like a lonely hearts for informatics geeks.
>
> Thanks in advance,
> Stephen
>
> ****************************************************************
> website: http://museumscomputergroup.org.uk/
> Twitter: http://www.twitter.com/ukmcg
> Facebook: http://www.facebook.com/museumscomputergroup
> [un]subscribe: http://museumscomputergroup.org.uk/email-list/
> ****************************************************************
****************************************************************
website: http://museumscomputergroup.org.uk/
Twitter: http://www.twitter.com/ukmcg
Facebook: http://www.facebook.com/museumscomputergroup
[un]subscribe: http://museumscomputergroup.org.uk/email-list/
****************************************************************
|