Data Science | Joe Creager

An interesting question was asked the other day regarding the apparent decline in frequency of the word I in English language books included in Google Books Ngram viewer. From the 1900s to the 1960s, the use of the word I appeared to be in steady decline. By 1965, the use of I had seemingly declined by 49.5% from its previous peak in 1901.

The plot above is generated by Google Books Ngram viewer, which allows users to aggregate short phrases from Google Books. Three things stand out in this data. First, the use of the word I in the English language was consistent for 100 years, second the use of the word I experienced a steady decline for 60 years, and third, almost as soon as the use of I reached its lowest, the trend reversed twice as fast as it happened in the first place.

What happened? Why did I fall out of favor and then come back? Perhaps Google’s optical character recognition algorithms are simply unable to differentiate between I and 1 or lowercase l due to the typography that was popular during that time? Perhaps the English grammar in published works simply declined from 1901 to 1960 and people began using me instead (e.g. me and my friends). Of course, if that were true it would mean we have entered a golden era of English grammar in published works, which seems unlikely.

If either of the scenarios presented above are true, we should see the use of the word me either stay the same or increase considerably. However, as you can see in ngram plot above for the word me, the trend is nearly identical to the word I. Perhaps the public education system simply began to discourage writing in the first person from 1900 until 1960, and then reverse that policy starting in the 1970s onwards? If this were true, we should see a complementary increase of third person nouns such as the word he and she.

Oddly enough, the ngram plot for he and she show a similar trend to I and me. The word she is less common in general, but saw a decline starting in 1940, and a reversal of that trend in the 1970s. The frequency of the word he sharply declined starting in 1900, just like the word I, and did not begin to become more frequent until the late 1990s. It is safe to say that the decline in the frequency of I and me were not caused by public school policies. If they were, we should have observed an increase in nouns associated with third person writing. Instead we observed the exact opposite.

It might feel like we are at a loss for explanation at this point. However, there is still something to take away from the ngrams that we have examined so far. For example, since I, me, she, and he all declined in frequency and then became more frequently used around the same time, it is reasonable to assume that those words were simply displaced. What if these plots do not tell us about the fall and rise of I, but rather the rise and fall of something else altogether?

The ngram plot above shows the frequency of the phrases technical manual, technical publication, and repair manual. Now, unlike the words I, me, he, she, and her, these phrases are significantly less common. However, the increase in the use of those phrases is an important reminder of why it is important to understand the dataset that you are working with.

Beyond being library-like, the evolution of the corpus throughout the 1900s is increasingly dominated by scientific publications rather than popular works.[ref]http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0137041[/ref]

The Google Books Ngram corpus does not represent the true popularity of a word or phrase. Rather, the Ngram corpus is comprised of books mostly provided by libraries. In addition, the corpus is contains one of each book, giving equal weight to popular and obscure publications. Rather than representing the frequency of words or phrases in the popular vernacular, Google Books Ngrams represents what University libraries choose to include in their collection.

Unfortunately, the Ngram corpus does not contain any information about the popularity of the publication such as published copies or copies sold for the books that the ngrams were extracted from. There is also no obvious way to relate the ngrams back to their original publications using the published corpus.

Although it is not useful for analyzing the popularity of words and phrases, it does provide interesting insight into trends in what libraries choose to collect. As for the fall and rise of I, the cause was librarians displacing their frequency by adding more scientific and technical publications to their collections. These publications simply do not use the word I, or he, she, him, her, you, or me for that matter.