Academic journal article
By Peng, Roger D.; Hengartner, Nicolas W.
The American Statistician , Vol. 56, No. 3
It is often recognized that authors have inherent literary styles which serve as "fingerprints" for their written works. Thus, in principle, one should be able to determine the authorship of unsigned manuscripts by carefully analyzing the style of the text. The difficulty lies in characterizing the style of each author, that is, determining which sets of features in a text most accurately summarize an author's style. When doing a quantitative or statistical analysis of literary style, the problem is finding adequate numerical representations of an author's inherent style.
Quantitative literary style analysis presents a unique opportunity to introduce and motivate many standard multivariate techniques. It is possible to view each text as a collection of multivariate observations, in which case we are immediately faced with the inherent difficulties of analyzing high-dimensional data. The usual questions are relevant: How can we visualize the data? What are the significant features? Are there any interesting structures? In this situation we also have the benefit of being able to rely on some immediate knowledge of the subject matter to analyze and understand the data. Traditional multivariate methods can then be used to contrast and compare the styles of several authors and possibly assign authorship.
1.1 Previous Work
There has been much work covering different aspects of this field. For a comprehensive review we direct the reader to Holmes (1985). Many early attempts to quantify style relied on concordances, or inventories of the frequency of every word in a text. In 1901, T. C. Mendenhall reduced the concordances of Shakespeare and Bacon to distributions of word lengths and plotted these distributions as graphs. His so called "characteristic curves" serve as an early example of the use of graphics in distinguishing authorship. Mendenhall examined the differences in the shapes of the curves (such as the location of the mode) and suggested that it was unlikely that Bacon wrote any of Shakespeare's disputed works. However, C. B. Williams reproduced some of Mendenhall's curves and noted that Mendenhall's conclusions may have been too strong. In fact, there was little evidence for or against the theory that some works written by Shakespeare could have been written by Bacon (Williams 1975). Brinegar (1963) also used word lengt h distributions to determine if Mark Twain had written the Quintus Curtius Snodgrass (QCS) letters. He used [chi square] tests and two-sample t-tests on the counts of 2, 3, and 4 letter words to check the agreement of the QCS letters with Twain's known writings. Thisted and Efron (1987) used the idea of vocabulary richness to determine the possibility of Shakespearean authorship of a newly discovered poem. They based their analysis of the poem on the rate of "discovery" of new words given the number of distinct words previously observed in the Shakespearean canon. Holmes (1992), in an example of the use of a standard multivariate analysis technique, used hierarchical cluster analysis to detect changes in authorship in Mormon scripture. He also used various measures of vocabulary richness to conduct his analysis.
There is no general agreement on the unit of analysis that should be used in authorship studies. In the previously mentioned examples, word length and vocabulary richness were the units used. Williams (1940) analyzed the sentence lengths of works written by Chesterton, Wells, and Shaw. He noticed that the log of the number of words per sentence appeared to follow a normal distribution. Morton (1965) also used sentence length in his analysis of ancient Greek texts. After initially using criteria such as word length and sentence length, Mosteller and Wallace (1963) focused on using function word counts to discriminate between the works of Hamilton and Madison in their seminal analysis of the Federalist Papers (see also Mosteller and Wallace 1964). …