Observational Studies of Rare Events: A Subset Selection Approach

Article excerpt


The use of observational data has long posed a barrier in the context of selection procedures. Indeed, the literature focuses on selection as a design methodology rather than as an analytic tool for nonexperimental data; see Gibbons, Olkin, and Sobel (1977) for a survey of selection and ranking methodology. In experimental settings sample sizes are controllable and typically identical across populations, thereby increasing the ease of using selection procedures. The virtual exclusion of observational data within selection procedures can be considered a major drawback to their potential usefulness. In this article, we offer a step toward eliminating this drawback by examining a subset selection approach to the study of observational data involving rare events. Such problems are characterized by large, uncontrollable sample sizes and extremely low incidence rates. Both of these characteristics fall well outside the domain of most selection procedures, for which small sample sizes and incidence rates much nearer \ are usually presumed throughout the development of "worst case" analyses (see, for example, Gupta and McDonald 1986).

Our work is motivated by studies arising in the context of urban traffic safety improvement studies. To improve the safety of a traffic network, the traffic analyst periodically seeks out those locations that are inherently hazardous and identifies them as in need of further safety improvements. Regardless of the identification method used, traffic analysts generally agree that the accident rate--defined as the number of accidents per million vehicles--is an important measure of traffic hazard (see, for example, Laughlin, Haefner, Hall, and Clough, 1975). Although hazard could also be measured in terms of accident volumes, accident rates treat individuals more equitably. For example, a heavily travelled intersection in a city might be associated with a larger number of total accidents, but a lower accident rate than a less-travelled location in an outlying area. If site improvement efforts are allocated on the basis of accident volumes, higher priority is implicitly given to lowering accidents for individual drivers living in highly populated areas. Within the context of urban traffic safety planning, there is a tremendous benefit to be gained from being able to correctly identify an acceptably small set of locations that very likely contains at least the most hazardous location in the system.

As evidenced by the data used in our numerical examples (see Higle and Witkowski 1988), the traffic hazard identification problem is characterized by extremely large and uncontrollable traffic volumes (i.e., sample sizes) that may differ radically from one location to another. Complicating matters even further, the accident rates are extremely low relative to traffic volumes. In practice a site is typically identified as hazardous if its observed accident rate over some specified time period exceeds the upper limit for a one-sided confidence interval for the mean accident rate over all sites in the region under consideration. (See Laughlin et al. (1975) for a summary of such methods.) The inadequacies of such methods are clear as discussed in Hauer and Persaud (1984) and Higle and Hecht (1989). Empirical Bayes procedures for identifying hazardous locations also have been proposed (Hauer and Persaud 1983, 1984; Higle and Hecht 1989; Higle and Witkowski 1988). As Higle and Hecht (1989) showed, these procedures consistently yield results virtually indistinguishable from results yielded by procedures based solely on observational data, because the empirical Bayes estimates are predominantly influenced by the observational data. This differs from other applications that have been studied. For example, within the context of a school ranking problem, Laird and Louis (1989) indicated that empirical Bayes procedures can yield population ranks that differ from those based solely on observational data. …