Sometimes contact information is incomplete but can be inferred from existing data. Gender is often missing from data but easy to determine based on first name.
One solution is to check names against existing data. A query can be run against correctly know valid name/gender pairs and the gender with the most occurrences of that name wins.
But what about new names and alternate spellings?
It turns out that there are features that are indicative of one gender or another. For example, it is more likely that a name ending in ‘a’ is female rather than male. There are also other patterns such as the last two letters of a name.
We could write a series of heuristics to make a determination but that does not seem like a scalable idea. I’d like to be able to apply this approach to other languages and not have to learn the ins and outs of each.
What we need to do is figure out which features indicate which gender and how strongly they do so.
I think ML tends to scare a lot of people. When I’m recommending a ML solution to someone, I tend to call it a statistical approach to the problem. So I’m going to call this solution a statistical approach.
What we are doing is classifying the data into one of two categories, male or female. For this I chose one of my favourite classifiers, Naive Bayes. I’m a fan of Naive Bayes because it’s basis is simple to understand and preforms decently well (in my experience).
I’m a big fan of the NLTK’s (Natural Language Toolkit) easy interface to classifiers such as Naive Bayes and it’s what I used for this project.
First, we’re going to need some data to train the classifier on to see which features indicate which gender and how much we can trust the feature. I grabbed training data from the US Census website and wrote an importer module for it in Python.
Second, we need a feature extractor to take a name and spit out features we think may indicate the gender well. I wrote a simple extractor that takes the last and last two letters and spits them out as a feature as well as if the last letter is a vowel:
Third, we need to test the classifier. We need to be sure that we separate the training data set from the test data set. If we just wanted to do a lookup, a hash table would be much more efficient. We’re interested in the classifier’s ability to determine the gender based on names it has not encountered before. So we randomly shuffle the data and split. I chose to split 80% for training and 20% for testing but that’s something you can play with.
Fourth, we need to learn which features matter. The NLTK provides a nice method which will tell us which features were most useful in determining the gender. This way we can concentrate on features that really matter.
I’ve done a lot of the wrapper work for you and put it up on github.
Checkout the gender prediction code here.
If you run
genderPredictor.py it will automatically train and test the
You can also import
genderPredictor into your own code and run the methods manually.
The most useful method to use within your own code is
classify(name) which takes a name and spits out the gender.
You can modify
_nameFeatures to play around and test other feature ideas.
If you find something that works better, please let me know and I’ll incorporate your idea and give you credit.
Hope this is useful and interesting; let me know what you think.