In case you hadn’t heard, J.K. Rowling has been recently discovered to be writing crime novels under the pseudonym Robert Galbraith. Language Log has a guest post by Patrick Juola, one of the computational/forensic linguists who found that Galbraith’s novel was statistically more similar to Rowling’s The Casual Vacancy than to several other crime novels.
Language is a set of choices, and speakers and writers tend to fall into habitual, or at least common, choices. Some choices come from dialect (the reason an Englishman drives a lorry but an American a truck), some from social pressure (if I need to impress someone with my vocabulary, I can utilize a polysyllabic lexicon instead of just using big words), and some just seem to come. An example of the latter category is in the use of many function words. If you ask yourself where the salad fork is relative to the plate, you quickly realize that it’s usually to the left of the plate. Or is it? It’s just as likely to be “on” the left of the plate, “at” the left of the plate, or perhaps “to” the left SIDE of the plate. Same fork, same position, and at least four different choices for how to describe it, none of which correspond to any sociolinguistic or cognitive variable with which I’m familiar.
But what we do know is that much of this apparently free variation is actually rather static at least at an individual level. So by studying examples of documents a person has written, we can build a model of the kind of choices that person makes.[…] Mosteller and Wallace studied the writing styles ofThe Federalist Papers in the mid-60s and showed, for example, that Alexander Hamilton never used the word “whilst” but that James Madison never used the word “while.” More interestingly, they both used the word “by,” but Madison consistently used it twice as often.
I was approached by a reporter, Cal Flyn, from the Sunday Times, to assess this kind of variation in the writings of “Robert Galbraith,” a first-time novelist and author of The Cuckoo’s Calling. (I learned later from the papers that the paper had received an anonymous tip via Twitter that Galbraith was the pen name of J.K. Rowling. And in retrospect there were a lot of other clues as well. For example, Galbraith apparently was surprisingly good at describing women’s clothing, possibly suggesting a female author.) Would I be willing to look into this? I said yes, of course, but with a couple of conditions. First, I needed clean (machine readable) copies of Cuckoo, and clean samples of something comparable undisputedly by Rowling herself. Secondly, I needed other comparable samples from other writers (distractor authors, to use the common term) to assess the degree of variation.
For the past ten years or so, I’ve been working on a software project to assess stylistic similarity automatically, and at the same time, test different stylistic features to see how well they distinguish authors. […] First, most people are average in word length, just as most people are average in height. Very few people actually write using loads of very long words, and few write with very small words, either. Second, you learn that average word length isn’t necessarily stable for a given author. Writing a letter to your cousin will have a different vocabulary than a professional article to be published in Nature. So it works, but not necessarily well. A better approach is not to use average word length, but to look at the overall distribution of word lengths. Still better is to use other measures, such as the frequency of specific words or word stems (e.g., how often did Madison use “by”?), and better yet is to use a combination of features and analyses, essentially analyzing the same data with different methods and seeing what the most consistent findings are. That’s the approach I took. (Read the rest at Language Log)
Juola stresses that this type of analysis can only show that different types of writing are more or less similar to each other, not that a certain person was definitely the author of a particular text, but evidently it was enough evidence to convince Rowling to admit to the pseudonym.
This type of analysis reminds me of a news story a while back showing that people who get along with each other are more likely to use similar frequencies of function words, and that this relationship holds for both famous correspondents like Freud and Jung as well as for modern couples on speed-dates.