Violin Plots: A Box Plot-Density Trace Synergism

Article excerpt

1. INTRODUCTION

Many different statistics and graphs summarize the characteristics of single batches of data. Descriptive statistics give information about location, scale, symmetry, and tail thickness. Other statistics and graphs investigate extreme observations or study the distribution of data values. Diagrams such as stem-leaf plots, dot plots, box plots, histograms, density traces, and probability plots give information about the distribution of values assumed by all observations.

The violin plot, introduced in this article, synergistically combines the box plot and the density trace (or smoothed histogram) into a single display that reveals structure found within the data. The introduction of this new graphical tool begins with a quick overview of the combination of the box plot and density trace into the violin plot. Then, three illustrations and examples show the advantages and challenges of violin plots in data summarization and exploration.

2. COMPONENT PARTS OF VIOLIN PLOTS

The violin plot, as depicted in Figure 1 and implemented in NCSS (1997) statistical software, combines the box plot and density trace into one diagram. The name violin plot originated because one of the first analyses that used the envisioned procedure resulted in a graphic with the appearance of a violin. Violin plots add information to the simple structure of the box plot that Tukey (1977) initially conceived. Although these original graphs are easily drawn with pencil and paper, computers ease subsequent modifications, refinements, and computation of box plots as discussed by McGill, Tukey, and Larsen (1978); Velleman and Hoaglin (1981); Chambers, Cleveland, Kleiner, and Tukey (1983); Frigge, Hoaglin, and Iglewicz (1989), and others.

Box plots show four main features about a variable: center, spread, asymmetry, and outliers. As an example, consider the box plot in Figure 1 for the data published by Hamermesh (1994). The ASA Statistical Graphics Section's 1995 Data Analysis Exposition analyzes these data, which report compensation of professors from all academic ranks in the United States. The labels in the diagram identify the principal lines and points which form the main structure of the traditional box plot diagram. As shown, the violin plot includes a box plot with two slight modifications. First, a circle replaces the median line which facilitates quick comparisons when viewing multiple groups. Second, outside points which are traditionally classified as mild and severe outliers, are not identified by individual symbols.

The density trace supplements traditional summary statistics by graphically showing the distributional characteristics of batches of data. One simple density estimator, the histogram, displays the distribution of data values along the real number line. Weaknesses of the histogram caused Tapia and Thompson (1979), Parzen (1979), Silverman (1986), Izenman (1991), and Scott (1992) to propose and summarize numerous alternative density estimators. One of these alternatives is the density trace described in Chambers, Cleveland, Kleiner, and Tukey (1983). Defining the location density d(x[where]h) at a point x as the fraction of the data values per unit of measurement that fall in an interval centered at x gives

d(x[where]h) = [summation of] [[Delta].sub.i] where i = 1 to n/nh, (1)

where n is the sample size, h is the interval width, and [[Delta].sub.i] is one when the ith data value is in the interval [x - h/2, x + h/2] and zero otherwise. In order to plot the density trace, first select a value for h and then compute d(x[where]h) on a dense grid of equally spaced x values. Connect the d(x[where]h) by lines. The shape of the d(x[where]h) curve is essentially driven by the interval length, h. It is very smooth for large values of h, and "wiggly" for small values.

Unfortunately, several density traces shown side by side are difficult to compare. Contrasting the distributions of several batches of data, however, is a common task. …