Machine Learning Newsletter

Name Changes Across Years in USA

In [1]:

Social security records names starting as early as 1880 and publishes the names along with their locations. My aim is to first look at the name distribution(most popular ones) over the years and then try to see the trends of the names in this post.

Why names?

Some may ask. Names are quite important for individuals as they become an integral part of their identities, but more importantly they actually could show social characteristics of parents as well. Further, they may convey the historical background, social-economic class, political view and many more important information about parents and also they may reveal some interesting information about society as well: cultural trends, language, popular movies(!) around the birth.

If you consider two societies where one of them prefers individuals over community and the other one prefers community over individuals. You would naturally expect that the distinctiveness of names would be higher in the first society as they want to be distinct and try to emphasize the individuality whereas the other society wouldn't have any problem to give same names for different people.

If this is too abstract, military would be a great example to make it concrete. It is not enough to make the clothing, looks(similar hair cuts, always shaved, accepting only a particular height/weight range) same but also everybody gets to called soldier. No names. There is no individuality, everybody is treated same and everybody is expected to behave, obey the rules under same conditions. Individuality or personal preferences are no longer very important.

So, names are important. But why do people choose over one name over others? As I mentioned earlier, names are not only dependent on parents but also society, trends, culture. They are influenced by society and due to partially for this reason, they may algo create biases for different segments in the society. Take a look at the following paper: some names(not people) get even racial discrimination if you are still not convinced that names are important and play somehow significant part in our lives.

Let's look at what data has to say name trends over years in USA.

In [5]:

The trend is generally increasing over the years, the bump around ~1960's is due to Baby Boomers. Since that jump is somehow outlier, it goes back to its normal around ~1970's. The increase around 1910 is much more spectacular as the baby number more than quadruples in 10 years.

One thing which is quite interesting to me is that number of girls consistently higher(around ~1900s) or very close to boys(around ~1940s), but suddenly something happends around ~1950's and then boys take lead and never lose it again. I first thought if some families do not register their baby boys, but the correlation is quite high between boys and girls. So if it would be the case, we would not see the correlation of boys and girls names but somehow inconsistency over the years. Therefore, if something unnatural is happening, the correlation would be lower. It could be an interesting research topic.(even the small bump in 1900 is similar for both gender).

In [6]:
F         165280729
M         168137041
Name: n_births, dtype: int64

There are at least 165280729 girls and 168137041 boys have been born over 134 years starting from 1880 to 2014. As the numbers do not include the the names given to less than 5 babies, this number could be considered as a lower bound.

In [7]:
In [8]:
In [9]:

All percantages of all two most popular names and also top 10 names decrease over time. Diversification of names play a significant role for the reason of above plots. Even so that, most of the newly introduced names stay a niche as it will become clearer from the entropy distribution that names are most evenly distributed around 1960's and then even if the diversification stays the same, the distribution looks more like a power law distribution than a uniform distribution.

In [10]:

Around depression years, number of distinct names even decrease but from starting 1960's, the number of names consistenly increase over time. Diversification of names could be attributed to a number of reasons; better communication tools make people learn different names, entertainment make people aware of different names(TV and Internet), some people wanted to give more distinct names due to individuality and also immigration may diversify the pool of names(e.g. spanish, chinese, indian names).

In [13]:
<ggplot: (287663801)>

First thing to get away from this graph is that the most popular name percentage decreases over time. This implicitly suggests that names get diversified over time, which I will show later in the distinct names over year graph. Second thing would be that, there appears to be a trend for the names except very recent years. This could be explained in two different ways. First one would be, since the percentages of names decrease over time the first and second most popular name difference gets smaller and smaller and most popular name could not keep the lead for a long time. Second, since the names diversification already took place, people may not really care about the "trend" anymore. The first one sounds to be more correct as we will see individual names show a trend over time when we look at the faceted plots in a bit.

Individual Names

Early years, there were a lot of Mary's and although Mary loses its first place over the years to many names, the percentage of names never reach the "golden age" of Mary(until 1930). Then Linda comes and becomes first but even after Linda ~1950, Mary becomes first place again to be lost to Lisa this time. The names consistently decreases over time after its initial popularity. Therefore, one may argue that trends will play a role. Also diversification of names play a role in the percentage decrease as more and more names are given to babies.

In [14]:
<ggplot: (287531909)>

Unsuprisingly enough, the male names show similar percentage distribution over time like females and they get also diversified over years. Naming conventions seem to show an interesting pattern independent from gender.

From these two graphs, one thing could be concluded very fast. If you are looking at the number of babies over time, you would see the baby boomer effect around 1960. But you could not see the effect of the increase in the popular names which suggests they did not choose a particular name over others. Otherwise, based on the number of babies, they would most probably change the name distribution of the popular names.

In [15]:
name gender n_births year rank pct_of_names
1452580 Bugra M 5 2003 11819.5 0.000253

Yes! Bugra does exist! Not very popular, though.

In [16]:
name gender n_births year rank pct_of_names
1696479 Khaleesi F 28 2011 4974.0 0.001598
1726874 Khaleesi F 146 2012 1514.5 0.008334
1760039 Khaleesi F 241 2013 1022.0 0.013877

Khaleesi is definitely more popular than my name and getting more popular it seems.

When we look at the common names for both genders(at any point, the name percentage would reach at least 1 percent of total population, we would get 64 names for boys and 101 names for girls. Over time, girl names seem to be more diverse.

In [18]:

Y scale is not same across plots(some names may be more common even if the plots look similar), I will replicate the same graphs properly at the end of the post providing the x-ticks and y-ticks.

In [19]:

Most of the names show similar patterns as female names, they get popular and then their popularity decreases over time. Some are like Thomas, has two peaks, their popularity increases and then they get similar popularity over the years but it is rare.

In [24]:

Entropy measures how the distribution of the names for a given year across genders resemble to the uniform distribution. Although for some part, it follows the distinct number of names, after 1960's, it starts to decrease which suggest the niche names and popular names(as a whole) increase their percentage.

What is next?

I will look at in the second post if there is some distinct pattern per state for the names. Social Security publish the names for states as well. Maybe an interactive map with a slider for years.

Plots with Y and X ticks

I reproduce the above individual name plots here for the sake of correctness. In here, y labels are the percentage of names which vary across different names and the x labels correspond to years.

In [27]: