|Books, hard at work.|
Perhaps you've heard of Google's attempt to digitize books and make them available to the public. There has been controversy about this project for a number of years (mainly surrounding copyright issues), but so far, Google Books has managed to scan 12% of all books published (approximately 15 million books). On the surface that many not sound like much, but it is an amazing feat. The majority of books have been made available from university libraries around the world. The works span the history of publishing (from the 1500's to the present) and represent many cultures and languages (English, French, Spanish, German, Chinese, Russian, and Hebrew).
In a recent issue of Science, a number of scientists and The Google Books Team published one of the first evaluations of "the corpus." The article describes a series of statistical analyses that provide insight into the evolution of grammar, cultural memory, censorship and other issues in "culturomics".
As an example, the team was able to follow changes in how irregular verbs become regularized over time. The past tense of regular verbs is generated by adding -ed (jump/jumped), but irregular verbs are idiosyncratic (sing/sang or burn/burnt). Some of these verbs are found in families and the analyses indicate that families behave differently. Some change rapidly through time, others don't, and some are strongly influenced by the country in which they are used.
Okay, that may sound a bit boring, but think of the last time you picked up and read an older book. The language is probably a bit different from that in contemporary books, the older a book is, the more peculiar the words may seem. It may take a few pages, but eventually you get used to the oddities and can read without too much trouble. Our language (and how we express our ideas) is changing all the time; books provide insight into these changes.
To request the January 14 issue of Science, click here. (you will have to scroll down the page to find the correct issue.)
To view the Science web page and a summary of the article and other discussions of the project, click here.