I created a machine learning model to classify names as masculine or feminine based on letters, length, and pronunciation by training it on a set of the most masculine and feminine names in United States name data. Exploratory data analysis, statistical testing, and supervised learning was used and I found the most accurate model was logistic regression trained with all features which had an accuracy of 86.8%.
Though gender neutral names have always been around, most names are associated with one gender or another. For English names, there are differences between masculine and feminine names. I read several papers by Herbert Barry and Aylene Harper to learn the exact differences. Feminine names often end in vowel sounds, especially A, but more specifically, they end in sonorants, which are smooth sounds such as vowels, L, N, R, or W. Masculine names tend to end in harsher sounds. Ending in N is the most gender-neutral ending. Additionally, I know that the lengths of feminine names are slightly longer on average, though length differs depending on what names are popular. I wanted to create a machine learning model to classify names as masculine or feminine using the initials, ending letter, and soundex code, a number representing the phonetic pronunciation.
My research question was:
How do length, initial, last letter, and phonetic pronunciation differ between masculine and feminine names?
This led to my null hypothesis:
Masculine and feminine names do not differ in letters, length, or pronunciation, so a machine learning model would have 50% accuracy.
and my alternative hypothesis:
Masculine and feminine names do differ in letters, length, or pronunciation, so a machine learning model would have greater than 50% accuracy.
Prior to begining this project, I wanted to find the most feminine, masculine, and gender-neutral names in the dataset. I did this by first making a set of all the names that have ever appeared as a female name, then a set of all the names that ever appeared as a male name. The names only in the female set are the most feminine names, the names only in the male set are the most masculine, and the intersection is the gender-neutral names. When creating text files of these names, I summed up the number of babies given each name over the 145 years of data. I used the 1,000 most common feminine and 1,000 most common masculine names as the data for the train/test split.
A visualization of the masculine and feminine name sets
Machine learning models need numeric data, so the initial and last letter columns were one-hot encoded. Columns were created with the number of letters in the name and the number of vowels in the name. A six letter soundex code was generated using a method from geeksforgeeks.com. Because the classification is binary, the target variable was the is_male column, which contained a 1 if the name was in the masuline set. This project was done in Python and the data was stored in a Pandas DataFrame, which had 55 columns. Since I am still learning about Machine Learning, I wanted to experiment with using all 55 columns and using fewer columns. To account for the curse of dimensionality, where high dimensional data appears sparse, principle component analysis was performed using the Sklearn library and five principle components were used. Finally, the data was standardized and split into a stratified training and testing set to keep the ratio of male to female names the same.
To explore the distributions of variables and the relationships between them, I created many graphs. Just a few of them are displayed here. There are a few important observations I made about the graphs. First, in the ending letters, there are no names ending in Q or X. There are also just under 500 names unding in A. Only seven of these are male, and, when I looked up their origins, I found that none of them are English names. Finally, while the Social Security Administration cuts off names at 15 letters with a minimum of two letters, none of the names were longer than 11 letters.
Data visualizations showing the relationships between is_male and other variables
Next, I performed statistical tests to determine which variables had a significant relationship with being male. Chi quared tests of independence were performed for the categorical variables and t-tests for the numerical variables. The only variables with a significant relationship with being male was number of vowels, starting with L, M, O, or W, and ending with A, C, D, E, H, I, K, L, M, N, O, R, S, or T. Many sonorants as an ending letter had a significant relationship.
I began by training a few supervised learning algorithms. Because I am still learning, I really wanted to explore how using different features might affect the accuracy. For each algorithm, I created a model with all the features, with only features chosen by greedy feature selection for that algorithm, and the five principle components. The algorithms I used were KNN, gaussian process, naive bayes, logistic regression, and random forest from the Sklearn library.
With PCA, the best model was KNN (82.6% accuracy), for all features it was logistic regression (84.6% accuracy), and for greedy feature selection it was logistic regression (84.2% accuracy). The accuracies were close, but I hyperparameter tuned logistic regression with all features to increase the accuracy. Using GridSearchCV from Sklearn, the best estimator was a model where C = 0.0886. After training this new model, the accuracy was 86.8% and the f1 score was 0.867, so false positives and false negatives were balanced. To determine the significance of this result, I ran a 10-fold cross validation and used a t-test to compare to 50% accuracy. 86.8% is significantly higher than the 50% that would come from guessing.
Confusion matrix of the logistic regression model after hyperparameter tuning
Confusion matrix of the logistic regression model after testing on select unisex names