Statistics in Sociology, 1950-2000
Raftery, Adrian E., Journal of the American Statistical Association
Sociology is the scientific study of modern industrial society. Example questions include: What determines how well people succeed in life, occupationally and otherwise? What factors affect variations in crime rates between different countries, cities, and neighborhoods? What are the causes of the increasing U.S. divorce rate? What are the main factors driving fertility decline in developing countries? Why have social revolutions been successful in some countries but not in others?
The roots of sociology go back to the mid-nineteenth century and to seminal work by Auguste Comte, Karl Marx, Max Weber, and Emile Durkheim on the kind of society newly emerging from the industrial revolution. Sociology has used quantitative methods and data from the beginning, but before World War II the data tended to be fragmentary and the statistical methods simple and descriptive.
Since then, the available data have grown in complexity, and statistical methods have been developed to deal with them, with the sociologists themselves often leading the way (Clogg 1992). The trend has been toward increasingly rigorous formulation of hypotheses, larger and more detailed datasets, more complex statistical models to match the data, and a higher level of statistical analysis in the major sociological journals.
Statistical methods have had a successful half-century in sociology, contributing to a greatly improved standard of scientific rigor in the discipline. Sociology has made use of a wide variety of statistical methods and models, but I focus here on the ones developed by sociologists, motivated by sociological problems, or first published in sociological journals. I distinguish three postwar generations of statistical methods in sociology, each defined by the kind of data it addresses. The first generation of methods, starting after World War II, deals with cross-tabulations of counts from surveys and censuses by a small number of discrete variables such as sex, age group, and occupational category; social mobility tables provide a canonical example. Schuessler (1980) gave a survey that largely reflects this first-generation work.
The second generation, starting in the early 1960s, deals with unit-level data from surveys that include many variables. This generation was galvanized by Blau and Duncan's (1967) highly influential book The American Occupational Structure, and by the establishment of Sociological Methodology in 1969 and Sociological Methods and Research in 1972 as publication outlets. These developments marked the corning of age of research on quantitative methodology in sociology. The third generation of methods, starting in the late 1980s, deals with data that are not usually thought of as cross-tabulations or data matrices, either because the data take different forms, such as texts or narratives, or because dependence is a crucial aspect. These generations do not have clear starting points and all remain active today; like real generations, they overlap.
Today, much sociological research is based on the reanalysis of large high-quality survey sample datasets, usually collected with public funds and publicly available to researchers, with typical sample sizes in the range of 5,000-20,000. This has opened the way to easy replication of results and has helped produce standards of scientific rigor in sociology comparable to those in many of the natural sciences. Social statistics is expanding rapidly as a research area, and several major institutions have recently launched initiatives in this area.
1. THE FIRST GENERATION: CROSS-TABULATIONS
1.1 Categorical Data Analysis
Initially, much of the data that quantitative sociologists had to work with came in the form of cross-classified tables, and so it is not surprising that this is perhaps the area of statistics to which sociology has contributed the most. A canonical example has been the analysis of social mobility tables, two-way tables of father's against respondent's occupational category; typically the number of categories used is between 5 and 17. …