Google Ngram Viewer provides searchable dataset of books


This article was originally on a blog post platform and may be missing photos, graphics or links. See About archive blog posts.

Want to know when ‘hep cat’ entered the popular lexicon? Or when ‘dying of consumption’ fell out of literary use? Or which of three former presidents -- Abraham Lincoln, George Washington or Thomas Jefferson -- made the most appearances in print in a given decade? (Turns out Washington surpassed Lincoln some time around 1928 and has remained in the lead ever since.)

Google’s latest data-visualization tool, Ngram Viewer, allows the curious to search through datasets of 500 billion words from 5.2 million books in Chinese, English, French, German, Russian and Spanish to determine the approximate frequency with which sets of up to three words or phrases have appeared from year to year. Users can search the data using the viewer tool or freely download the datasets for their own use.


The datasets backing the Ngram Viewer are a subset of the more than 15 million books Google has digitized since 2004.

‘We know nothing can replace the balance of art and science that is the qualitative cornerstone of research in the humanities,’ wrote Google Books engineering manager Jon Orwant on the company’s blog. ‘But we hope the Google Books Ngram Viewer will spark some new hypotheses ripe for in-depth investigation, and invite casual exploration at the same time.’

-- Abby Sewell