Differences in Masculine and Feminine Names

Though gender neutral names have always been around, most names are associated with one gender or another. For English names, there are differences between masculine and feminine names. I read several papers by Herbert Barry and Aylene Harper to learn the exact differences. Feminine names often end in vowel sounds, especially A, but more specifically, they end in sonorants, which are smooth sounds such as vowels, L, N, R, or W. Masculine names tend to end in harsher sounds. Ending in N is the most gender-neutral ending. Additionally, I know that the lengths of feminine names are slightly longer on average, though length differs depending on what names are popular. I wanted to create a machine learning model to classify names as masculine or feminine using the initials, ending letter, and soundex code, a number representing the phonetic pronunciation.

Research Question and Hypothesis

My research question was:

How do length, initial, last letter, and phonetic pronunciation differ between masculine and feminine names?

This led to my null hypothesis:

Masculine and feminine names do not differ in letters, length, or pronunciation, so a machine learning model would have 50% accuracy.

and my alternative hypothesis:

Masculine and feminine names do differ in letters, length, or pronunciation, so a machine learning model would have greater than 50% accuracy.

Finding the Most Feminine and Masculine Names

Prior to begining this project, I wanted to find the most feminine, masculine, and gender-neutral names in the dataset. I did this by first making a set of all the names that have ever appeared as a female name, then a set of all the names that ever appeared as a male name. The names only in the female set are the most feminine names, the names only in the male set are the most masculine, and the intersection is the gender-neutral names. When creating text files of these names, I summed up the number of babies given each name over the 145 years of data. I used the 1,000 most common feminine and 1,000 most common masculine names as the data for the train/test split.

A visualization of the masculine and feminine name sets

Data Preprocessing

Machine learning models need numeric data, so the initial and last letter columns were one-hot encoded. Columns were created with the number of letters in the name and the number of vowels in the name. A six letter soundex code was generated using a method from geeksforgeeks.com. Because the classification is binary, the target variable was the is_male column, which contained a 1 if the name was in the masuline set. This project was done in Python and the data was stored in a Pandas DataFrame, which had 55 columns. Since I am still learning about Machine Learning, I wanted to experiment with using all 55 columns and using fewer columns. To account for the curse of dimensionality, where high dimensional data appears sparse, principle component analysis was performed using the Sklearn library and five principle components were used. Finally, the data was standardized and split into a stratified training and testing set to keep the ratio of male to female names the same.

Exploratory Data Analysis

To explore the distributions of variables and the relationships between them, I created many graphs. Just a few of them are displayed here. There are a few important observations I made about the graphs. First, in the ending letters, there are no names ending in Q or X. There are also just under 500 names unding in A. Only seven of these are male, and, when I looked up their origins, I found that none of them are English names. Finally, while the Social Security Administration cuts off names at 15 letters with a minimum of two letters, none of the names were longer than 11 letters.

Data visualizations showing the relationships between is_male and other variables

Statistical Testing

Next, I performed statistical tests to determine which variables had a significant relationship with being male. Chi quared tests of independence were performed for the categorical variables and t-tests for the numerical variables. The only variables with a significant relationship with being male was number of vowels, starting with L, M, O, or W, and ending with A, C, D, E, H, I, K, L, M, N, O, R, S, or T. Many sonorants as an ending letter had a significant relationship.

Supervised Learning

I began by training a few supervised learning algorithms. Because I am still learning, I really wanted to explore how using different features might affect the accuracy. For each algorithm, I created a model with all the features, with only features chosen by greedy feature selection for that algorithm, and the five principle components. The algorithms I used were KNN, gaussian process, naive bayes, logistic regression, and random forest from the Sklearn library.

Results

With PCA, the best model was KNN (82.6% accuracy), for all features it was logistic regression (84.6% accuracy), and for greedy feature selection it was logistic regression (84.2% accuracy). The accuracies were close, but I hyperparameter tuned logistic regression with all features to increase the accuracy. Using GridSearchCV from Sklearn, the best estimator was a model where C = 0.0886. After training this new model, the accuracy was 86.8% and the f1 score was 0.867, so false positives and false negatives were balanced. To determine the significance of this result, I ran a 10-fold cross validation and used a t-test to compare to 50% accuracy. 86.8% is significantly higher than the 50% that would come from guessing.

Confusion matrix of the logistic regression model after hyperparameter tuning

Confusion matrix of the logistic regression model after testing on select unisex names

Applying the Model

To really test the model, I wanted to apply it to certain names in the gender-neutral names set. Any name that becomes very popular for one gender is bound to show up at least a few times for the gender not traditionally associated with the name. I store the gender-neutral names in a text file with the male count, female count, and percent difference for each name. I selected the names that had a combined total of over 5,000 and a percent difference of at least 175%, meaning that it skews heavily towards one sex. Around 2,500 names were tested, but since more were feminine, 217 feminine names were randomly dropped so the classes were equal. The majority gender was used to determine whether the is_male column would contain a 0 or 1.

After applying the same data preprocessing, I used the best model that had already been trained to predict the labels for these names. The resulting accuracy was 79.7%, and, as seen in the confusion matrix, more feminine names were incorrectly identified as masculine than the other way around. As before, I ran a 10-fold cross validation and a t-test which showed that the results of gendering the test names was significantly lower than the accuracy of 86.8% from using the most feminine and masculine names, but it was significantly better than guessing.

Name Gender Classifier

Classifying a Name as Masculine or Feminine

Summary