Iterative Automated Record Linkage Using Mixture Models
Larsen, Michael D., Rubin, Donald B., Journal of the American Statistical Association
The goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more likely among records pertaining to a single person than among records for different people, the observed patterns for pairs of records can be viewed as arising from a mixture of matches and nonmatches. Mixture model estimates can be used to partition record pairs into two or more groups that can be labeled as probable matches (links) and probable nonmatches (nonlinks). A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data, as they become available, into estimates of model parameters, and classifies pairs as links, nonlinks, or in need of further clerical review. The procedure is illustrated with five datasets from the U.S. Bureau of t he Census. It appears to be robust to variations in record-linkage sites. The clerical review corrects classifications of some pairs directly and leads to changes in classification of others through reestimation of mixture models.
KEY WORDS: Administrative records; Census; Expectation-maximization; Expectation-conditional maximization; File matching; Latent-class models; Post-enumeration survey.
Record linkage entails comparing records in one or more data files when the records being compared arise either from a single person or from two different people. The linkage can be implemented to create a single data file from several files or to remove duplicate records from an existing single file. Manual clerical review nearly always can ascertain matching (same person) versus nonmatching (different people) pairs of records, but for large administrative databases, this process can be expensive and time consuming. The aim of automated linkage is to use computers to perform matching operations quickly and accurately.
Mixture models describe a population in terms of underlying subpopulations, which are not fully identified from observed data. When applied to record-linkage data, mixture model estimates can be used to partition the cases (pairs of records) into classes (e.g., linked record pairs, nonlinks), but not necessarily into classes that correspond to desired divisions of the data (e.g., matches, nonmatches). The accuracy of the partition is known when clerks determine which links and nonlinks are matches and which are nonmatches.
Here we describe a record-linkage procedure with four components for use when comparing two files without duplicates. First, our procedure selects a mixture model from a set of candidate models by comparing probabilities estimated via maximum likelihood under the models to empirical probabilities obtained from the data. Second, based on the selected model, the cases (pairs of records) are sorted according to their estimated probabilities of being a match. Cases with high probabilities are designated links, whereas those with low probabilities are designated nonlinks. Third, some cases are sent to clerical review for determination of match status. Fourth, the model is refitted using both the clerically classified cases and the remaining unclassified cases. That is, the fit of the model is improved by adding information from clerical review of a few cases. The procedure iterates through identifying records, clerical review, and refitting as time and clerical resources allow, or until little additional improvem ent is needed or made. We stop our procedure when clerical review finds few new matches. The choice of a reasonable initial model along with clerical review and updating provide robust results. Our procedure appears to work well when applied to Census data.
Section 2 discusses mixture models, with and without interactions in the submodels for mixture classes, for categorical data. …