Risk adjustment has broad general application and is a key part of the Patient Protection and Affordable Care Act (ACA). Yet, little has been written on how data required to support risk adjustment should be collected. This paper offers analytical support for a distributed approach, in which insurers retain possession of claims but pass on summary statistics to the risk adjustment authority as needed It shows that distributed approaches function as well as or better than centralized ones--where insurers submit raw claims data to the risk adjustment authority--in terms of the goals o frisk adjustment. In particular, it shows how distributed data analysis can be used to calibrate risk adjustment models and calculate payments, both in theory and in practice--drawing on the experience of distributed models in other contexts. In addition, it explains how distributed methods support other goals of the ACA, and can support projects requiring data aggregation more generally. It concludes that states should seriously consider distributed methods to implement their risk adjustment programs.
The Patient Protection and Affordable Care Act (ACA) establishes state-based marketplaces, known as "exchanges," where individuals and small employers will be able to purchase health insurance starting in 2014. One potential concern is that there may be adverse selection in the exchange plans. Adverse selection is the allocation of people into plans based on determinants of expected medical spending that plans cannot observe or are prohibited from using to set premiums. The possibility of adverse selection can give purchasers and issuers of insurance incentives to choose or offer plans on the basis of selection rather than cost, quality, or efficiency.
The ACA takes several steps to mitigate problems from adverse selection. Most important, it establishes a risk adjustment program to provide payments to health insurance issuers that cover higher-risk populations financed by payments from issuers that cover lower-risk populations. (1) Given that insurers must cover all applicants on a "guaranteed issue" basis and cannot vary premiums on health status or fully adjust for other factors such as age, risk adjustment functions to balance risk among issuers and make them indifferent as to which individuals they enroll. The goals of the program, as set out by the U.S. Department of Health and Human Services (HHS 2011, [section] 153.330), are as follows:
1. Accurately explain cost variation within a given population;
2. Choose risk factors that are clinically meaningful to providers;
3. Encourage favorable behavior and discourage unfavorable behavior;
4. Limit gaming;
5. Use data that is complete, high in quality, and available in a timely fashion;
6. Provide stable risk scores over time and across plans; and
7. Minimize administrative burden.
As HHS (2011, [section] 153.340) points out, there are three ways to collect the data necessary to support a risk adjustment program:
1. A centralized approach, in which insurers submit raw claims data to HHS;
2. An intermediate approach, in which insurers submit raw claims data to state risk adjustment authorities; and
3. A distributed approach, in which insurers retain possession of raw claims data, but pass on summary statistics to HHS and state authorities as necessary.
Understanding the strengths and weaknesses of these competing approaches is important. Although HHS (2012) has indicated that it would use a distributed approach when operating a risk adjustment program on behalf of a state, it also announced that it was permitting states that elect to operate a risk adjustment program to choose the data collection approach that best suits their needs. Thus when deciding how to implement their programs, states will want to evaluate the likely effects of each approach in terms of the goals previously enumerated.
Yet, there has been little work on identifying whether and how distributed data approaches could be used as the basis for a modern risk adjustment system. In addition, limited research has sought to compare the three approaches from the perspective of different stakeholders, either in theory or as they are likely to be implemented.
This paper seeks to fill this gap. It shows that distributed data analysis can be used to calibrate risk adjustment models and calculate payments. It proposes a way to do this in practice, and shows that the statistical limitations of distributed analyses are unlikely to be important. It also catalogues the other strengths and weaknesses of a distributed approach, relative to an intermediate or a centralized approach. (2) In general, I find that a distributed approach is better in terms of data quality, minimizing administrative costs, and protecting the privacy and confidentiality of sensitive information. I conclude that a distributed approach functions as well as a centralized approach in terms of the possibility of issuers' errors, misreporting, methodological completeness, and support for other functions requiring encounter-level data.
In addition, I find that other goals of the ACA, such as facilitating secondary uses of risk adjustment data, can also be accomplished through a distributed model. And, based on experience employing distributed data models in other contexts, there is another argument for their use: by mitigating concerns related to data collection, they encourage greater collaboration among plans and between plans and government agencies. For these reasons, I conclude that states should seriously consider distributed methods to carry out risk adjustment under the ACA.
The paper proceeds as follows. The first section outlines how distributed and centralized approaches operate and compares their strengths and weaknesses in terms of the goals already mentioned. It compares the approaches from the perspective of HHS, state risk adjustment authorities, health insurance issuers, providers, and the public, more generally. It also identifies the challenges unique to each approach and the challenges that they share. The second section explores the implications of previous uses of distributed approaches to analysis of individual-level data, such as the Food and Drug Administration's (FDA) Sentinel program, for risk adjustment under the ACA. The third section explains how a distributed approach to risk adjustment under the ACA might work in practice. In particular, it shows that a distributed approach would allow for estimation (and recalibration) of risk adjustment models; for reconciliation, verification, and auditing of risk adjustment and other transitional payments under the ACA; and for purchasers switching between plans over time. The fourth section concludes with broader implications of the choice of data collection method for risk adjustment under the ACA.
[FIGURE 1 OMITTED]
Comparing Distributed and Centralized Approaches: Theory
Both approaches to data collection for risk adjustment start from the same point: individual-level claims data, also known as encounter data. Each claim generally contains information on the amount paid by the insurer, the procedures performed by the health care provider, and the health care provider's diagnoses of the insured patient. Each claim also contains an individual-specific identifier that can be matched to other individual-level demographic and plan characteristics, so that claims can be used to construct an individual-level treatment and expenditure history for a given coverage period.
The approaches also end at the same point: providing support for such analyses as estimation and recalibration of risk adjustment models; calculation of risk weights and risk scores; and determination of the payments that are required to and from each issuer in order to balance the expenditure consequences of differences in average risk scores across plans.
As Figures 1 and 2 show, the approaches differ in the paths they take. In a centralized approach (Figure 1), individual-level encounter data are transmitted to the state risk adjustment authority or HHS with periodic updates. The authority or HHS then aggregates the data and analyzes them. In a distributed approach (Figure 2), plans do not transmit encounter or individual-level data on a wholesale basis to the risk adjustment authority. Instead, a distributed approach allows such data to be kept securely within each issuer's computer system. Risk adjustment authorities conduct their analyses by submitting a computer program (written in a language such as SAS) to each insurer, retrieving summary statistics, and aggregating the statistics to make the necessary calculations. The mechanics of this are discussed in greater detail later.
Because the authority or HHS can require issuers to define the variables and the formats of their encounter data identically, any analysis that is separable across individuals and can be conducted on a centralized basis can also be conducted on a distributed basis. (3) The requirement of identically defined data enables the authority to obtain comparable results from all issuers. As long as the analyses are separable, there will always be a set of issuer-level results that can be aggregated so they are equivalent to what the authority could have obtained if it had possession of all of the individual-level data. Separable analyses include essentially all the analyses that a risk adjustment authority might need to undertake to fulfill its obligations under the ACA (Roski and McClellan 2011; with possible exceptions, and approaches to address these exceptions, discussed in the next section).
[FIGURE 2 OMITTED]
The simplest example is the calculation of an average. An average is separable across individuals because the average of a population is equal to the weighted averages of its subpopulations. To make this more tangible, if there are two issuers in a state, one with 1,000 enrollees who have average spending of $3,000, and another with 2,000 enrollees who have average spending of $6,000, then the average spending in the state is [(1,000 / 3,000) x $3,000] + [(2,000 / 3,000) x $6,000], or $5,000.
Linear regression, the basis for calculating risk weights and risk scores, is also separable across individuals. This result can be seen intuitively in the case of regression involving only indicator variables, that is, variables that are equal to either 0 or 1. In this case, estimation of the regression coefficients is equivalent to the calculation of a set of conditional averages, each of which can be calculated at the issuer level and aggregated. The appendix to the paper proves this result for linear regression in general terms.
In theory, there is no basis for the claim (subject to the exceptions subsequently discussed) that a centralized approach would result in a more complete or actuarially sound risk adjustment methodology. As long as the set of feasible methodologies is separable, every centralized analysis can be done on a distributed basis. There is also no basis for the claim that a centralized approach would provide better support for other functions that require encounter data. As long as those functions are separable across individuals, the support provided by a distributed approach would be identical to the support provided by a centralized one. (4) Finally, there is no basis for the claim that a distributed approach raises the risk of increasing insurer error. Even with a distributed approach, the choice of models, the conversion of models into program code, and the program code itself are under the control of the state risk adjustment authority or HHS. Issuers simply execute the code and return the output. Issuers could be required also to return program log files to verify that the code had been executed correctly.
On all other dimension--including privacy, data quality and validity, gaming, and administrative cost--a distributed approach performs as well as or better than a centralized one.
A distributed approach clearly offers advantages over a centralized approach on privacy grounds, from the perspective of both purchasers and issuers. From the perspective of purchasers, centralized approaches are inferior because they require the creation of at least one additional copy of personal health information, along with the purchasers' demographic characteristics: the original copy, contained in the issuer's system, and the duplicate copy or copies, contained in the state risk adjustment authority's or HHS's system. Each additional copy, in turn, increases the possibility of unauthorized or accidental disclosure.
Standard privacy protections can address, but can not obviate, this concern. One commonly used protection is to de-identify the copy or copies of personal health information submitted to the public authority. In this context, de-identification involves stripping the copy of individual-specific variables that can be used in matching to external sources, and instead including an encrypted individual identifier that allows the holder of the data to match events within the copy but not outside of it.
Although this makes it more difficult to link the information in the copy to a particular individual, it does not make it impossible. First, it depends on the security of the encryption key; if the key is improperly or accidentally disclosed, then the security offered by the encryption is breached. Second, even if the key were kept secure, the presence of geographic location and demographic characteristics on the copy mean that unauthorized users could determine individuals' identities probabilistically. Even if a centralized data base omitted information such as name and address, it effectively would be impossible to prevent re-identification of individuals in longitudinal files that contained enough detail to support multiple uses such as risk adjustment (Brown, Holmes, Shah, et al. 2010).
An example illustrates how such identification could occur. If a de-identified data file contained zip code, age, gender, family status, employment status, and income, then some individual in it could be identified with probability 1/ n, where n is the smallest number of individuals who have a particular zip code/age/gender/family status/employment status/income combination. As the set of variables used in the risk adjustment process got larger, the ability of unauthorized probabilistic identification would grow. Similarly, people in sparsely populated areas would be more susceptible to unauthorized or accidental identification than people in densely populated areas. The Centers for Medicare and Medicaid Services (CMS) (2012) privacy rules use a cell size of 10 individuals as the benchmark below which privacy concerns become important. Applying this standard, many zip codes are sufficiently sparsely populated to enable individuals to be identified probabilistically using the variables that are likely to be employed in risk adjustment.
A distributed approach also offers protections important to issuers and the functioning of competitive markets. In particular, a distributed approach makes it harder to use claims data to infer competitively sensitive information on prices paid to providers, plan design, and premium determination. In a centralized approach, individual-level encounter data with zip codes and other variables likely to be used in risk adjustment can be easily used for this purpose. Because there are so many fewer issuers than individuals, strategies designed to prevent identification of individuals can not, in general, prevent identification of issuers. Aside from enhanced security, preventing unauthorized use of issuers' competitively sensitive information generally requires making the underlying data less useful for authorized purposes, such as encrypting one or more of the links between geographic area, plan characteristics, prices, and premiums or increasing the level of aggregation at which geographic area and plan characteristics are reported.
Data Quality and Validity
Ensuring data quality, validity, and consistency is a key challenge for both centralized and distributed approaches. Both require agreement by issuers on a common set of variable definitions and data architecture; validation of data prior to use in model calibration or other analysis; and establishment of processes by which state risk adjustment authorities or other users can query the data and obtain answers to questions that may arise about variable definition and data architecture. A centralized approach does not offer any obvious advantages along these dimensions over a distributed one. Standardization of data across issuers and resolution of unforeseen data inconsistencies require back-and-forth communication between data users and issuers regardless of the way that the data are stored.
Auditing is expected and necessary to verify that the data submitted by issuers accurately reflect the issuers' actual experience with their members. But such auditing is equally necessary--and equally burdensome--under both approaches. The state risk adjustment authority or HHS must have the ability to randomly require issuers to provide access to underlying data, whether those data are stored on a distributed or centralized basis.
Indeed, a distributed approach may enhance data quality and validity, relative to a centralized approach. Research on distributed data models conducted by McClellan (2011) at the Engelberg Center on Health Care Reform at the Brookings Institution and funded by the Robert Wood Johnson Foundation, suggests that storing risk adjustment data at the issuer level reduces the effort required to achieve a given level of quality. This is because the data reside with the entities that have the deepest understanding of them, and thus the greatest ability to address complications. In addition, a distributed approach automatically highlights data inconsistencies that might otherwise go unnoticed. Differences in the range or definition of variables across issuers become stark when the results of distributed analyses are gathered for purposes of aggregation.
Errors and the Potential for Misreporting
Another important goal for the design of a risk adjustment system is resistance to errors and the potential for misreporting. On this dimension, a distributed model is neither better nor worse than a centralized one. This is because, in both models, the risk of errors or misreporting arises when the individual-level encounter data sets are formed, prior to their aggregation or submission to the risk adjustment authority; neither model confers an advantage over the other at this stage in the process.
An example helps to illustrate. The most basic risk of error or potential for misreporting comes from the possibility that an issuer reports a higher average risk score or set of risk markers than was truly present in the issuer's pool. For this misreporting to go undetected, the other summary statistics that an authority would request, such as those that form the basis for the recalibration of the risk adjustment model, would also have to be altered. If they were not, it would be immediately obvious that the risk scores or markers were misreported. This is because those summary statistics (e.g., the set of covariances between the risk markers, and between the risk markers and medical spending) have a set of known functional relationships with the average risk scores and markers. (5) Practically, there are so many of these relationships that the only way that misreported risk scores or markers could go undetected would be if the individual-level encounter data on which the scores or markers were based were themselves misreported.
Other potential issues of concern include errors or otherwise inaccurate reporting of spending, the omission of relatively low-risk purchasers from data, or inaccurate coding. These issues arise equally, however, in a centralized or a distributed context. And, in a distributed model, risk adjustment authorities would have the right to request the same summary statistics for validation purposes that they would be able to compute were they in possession of a full copy of the individual-level data. Thus, as it concerns the detection of errors and the potential for misreporting, centralized and distributed data models perform at equivalent levels.
Centralized and distributed approaches incur differ sorts of administrative costs. Aggregate administrative costs are probably lower in a distributed approach, but this conclusion depends on assumptions (although realistic ones) about the rules that will likely govern the risk adjustment data.
At the data collection stage, a distributed approach is clearly less administratively burdensome, for several reasons:
1. It does not require the secure exchange of very large individual-level data sets;
2. It allows updates and expansions of the data sets to be carried out at the issuer level without the involvement of the state or HHS;
3. It eliminates the need to create, secure, maintain, and manage access to a complex central data warehouse (Brown, Holmes, Shah, et al. 2010);
4. It leverages existing infrastructures established to support exchanges, such as the data infrastructure established to support the Qualified Health Plan accreditation process (McClellan 2011); and
5. As discussed later, the U.S. government has experience with implementation of distributed data collection and so has an established base of specifications, software, and potential vendors.
More generally, the choice between a centralized and a distributed approach involves a trade-off. At the analysis stage, in a centralized approach, once all of the issuer-specific individual-level data sets have been collected and validated, it is easier to execute a specific analysis. The data reside in one place, so summary statistics need not be obtained from multiple issuers and aggregated to produce a final result.
This apparent advantage, however, may not be as great as it seems. From the perspective of the state risk adjustment authority or HHS, most or all of the communication between its analysts and the issuers involved in executing a query can be automated. For example, a software front-end can convert an aggregate query into several issuer-specific ones, retrieve issuer-specific summary statistics, and aggregate them with little or no effort on the part of the analyst.
In addition, the advantage of a centralized approach in terms of speed and convenience may not be applicable in the risk adjustment context. The greater risks inherent in unauthorized use of data in centralized approaches are likely to require greater security and access control. The need to ensure public trust that the state risk adjustment authorities and HHS will only use the data for approved purposes will also likely limit the ability of analysts to access the individual-level data or run queries that have not been externally reviewed (Diamond, Mostashari, and Shirky 2009). These factors will tend to mitigate any advantages that centralized approaches might have.
Previous Uses of Distributed Analysis of Individual-level Data
Distributed analysis of individual-level data is not a new idea. In the late 1990s, the Technological Change In Health Care (TECH) Research Network formed the basis of an international collaboration of investigators in clinical medicine, economics, and epidemiology from 16 countries (McClellan and Kessler 1999). The TECH network provided evidence at the disease level on differences in technological change across developed countries, and the causes and consequences of those differences. Because this research required individual-level data, a centralized approach to analysis was not feasible. In all of the member countries, individual-level data were subject to confidentiality restrictions, so sharing them internationally would have been logistically difficult or even prohibited by law. The TECH network used distributed methods, similar to those discussed earlier, to examine the causes and consequences of innovations in treatment for heart attack, such as use of thrombolytics and primary angioplasty, for cost and outcomes (McClellan and Kessler 2002).
In one example of this work (Pilote et al. 2003), TECH investigators compared cardiac procedure use and outcomes in elderly patients with heart attacks in the U.S. and Quebec, Canada. They used administrative records in the two countries to construct two longitudinal databases with identically defined variables. They estimated linear regression models using the distributed methods described in the appendix of this paper to obtain consistent estimates of the age-specific trends in procedure use and mortality, holding constant comorbidities and other demographic characteristics. They found that use of cardiac procedures grew more rapidly in the U.S., particular for patients age 75 and older, and that this was accompanied by a statistically significant greater decline in mortality from heart attack.
More recently, distributed approaches have been used successfully for post-marketing surveillance and research purposes by the U.S. Food and Drug Administration in its Mini-Sentinel Pilot program and the U.S. Centers for Disease Control (CDC) in its Vaccine Safety Datalink Project. I discuss each of these experiences in turn, and conclude with some implications for use of distributed analysis for risk adjustment purposes under the ACA.
The Mini-Sentinel Pilot (6)
The Mini-Sentinel Pilot program gives the FDA the capacity to analyze the health records of more than 60 million people to assess the safety of approved drugs. There are three types of Mini-Sentinel assessments: active surveillance of new molecular entities; rapid assessment of suspected adverse outcomes; and assessments of the impact of the FDA's regulatory actions. Recent Mini-Sentinel activities include assessment of the risk of heart attack in users of certain antidiabetic drugs; the risk of adverse cardiac outcomes in users of drugs for smoking cessation; and the consequences of the labeling change advising against long-term use of long-acting beta agonists.
The proposed risk adjustment mechanism shares many of the objectives of Mini-Sentinel, including:
* To coordinate among multiple private health insurance issuers that provide data;
* To analyze individual-level encounter data, including data on pharmaceutical use, that are transformed into a standardized format and reside on issuers' computer systems;
* To derive clinically meaningful information from insurance billing records;
* To adjust outcomes of interest for differences in risk factors;
* To achieve rapid turnaround of analyses requested by the operating authority; and
* To support multiple, secondary uses for the data.
The Vaccine Safety Datalink Project (7)
The Vaccine Safety Datalink (VSD) project enables researchers at the CDC to analyze the medical records of 8.8 million adults and children to detect rare and long-term adverse events from approved vaccines. Recent VSD studies include assessment of the impact of the hepatitis B vaccine on the risk of autoimmune thyroid disease; the safety of trivalent inactivated influenza vaccine in children ages six months to 23 months; and the effect of early thimerosal exposure on neuropsychological outcomes at ages 7 to 10. The VSD also has been used to conduct rapid-cycle analyses for post-marketing monitoring purposes.
From 1991 to 2000, the VSD used a centralized approach that required participating provider and insurer organizations to send individual-level encounter and demographic data files annually to the CDC for merging. When data were needed for a specific study, the CDC would send a subset to the organization responsible for undertaking the analysis. Because of confidentiality concerns, the centralized approach was replaced by a distributed approach in 2001.
The distributed approach used by VSD is similar to that of the Mini-Sentinel project. The VSD coordinates among 10 participating data providers. Each provider assembles and maintains its own individual-level files according to a standardized protocol. CDC researchers conduct analyses on providers' data in two ways: by downloading to a secure "hub" computer programs that are retrieved by providers at specified intervals (with results returned to the hub upon execution), or by submitting programs interactively through an encrypted direct connection. Each provider can review the output from the programs run on its machines, but does not have access to programs or output run on others' machines.
The success of distributed data collection in these settings shows that a distributed approach could be used for risk adjustment purposes. Although the specific variables used in analysis may differ, the data holders, high-level goals, and mechanics necessary for implementation are much the same. The fact that there are direct financial implications from risk adjustment may make gaming a more important concern in that setting, but as discussed previously, it does not militate against (or in favor of) a distributed approach.
[FIGURE 3 OMITTED]
A Distributed Approach to Risk Adjustment in Practice
Figure 3 presents one possible implementation of a distributed approach to risk adjustment under the ACA. As the figure's top panel shows, implementation has five phases. In the first phase, the parties would determine the basic architecture and specific parameters of the process. The bottom panel of the figure contains the most important of these: the network design; the variables to be included; the rules governing validation, auditing, and access for secondary analyses; and the processes for resolving questions and disagreements. The discussion about these terms should be informed by the experience of the FDA and CDC with Mini-Sentinel and VSD; they might also be informed by discussion with private consortia (Brown, Holmes, Syat, et al. 2010).
In the second phase, issuers would reformat their individual-level encounter data into the standardized format established in the first stage. As issuers completed this step, their data would be validated (the third phase) by state risk adjustment authorities, HHS, or a third party hired for this purpose (such as an actuarial or consulting firm). (8) The validation process should include a method of resolving technical and policy questions that might arise.
In the third and fourth phases, the data would be validated and audited periodically to ensure that they accurately reflected enrollees' true underlying health status. Audits can take several forms, including allowing the auditor to randomly select individual patients from the issuer's data and request supporting data in the form of original claims or encounter data, with the option of contacting providers to ensure the accuracy of the claims themselves. The audit process should include a method of resolving disagreements and any questions concerning adjustments proposed by the risk adjustment authority.
In the fifth and final phase, the state risk adjustment authority or HHS would use the validated and audited data to estimate risk adjustment models, calculate risk scores, and transfer risk funds. As shown previously, any analysis that is separable across individuals and which can be conducted on a centralized basis can also be conducted on a distributed basis. This includes linear regression modeling, the likely choice for risk adjustment under the ACA. (9) By implication, it includes linear regression with interaction terms, which would enable the risk adjustment model to allow risk factors to affect risk score nonlinearly, as long as the interaction terms affect risk score additively. Thus, a distributed approach can be used to estimate or recalibrate essentially all feasible risk adjustment models, as well as be used with pre-specified risk weights (as is likely in the first years of risk adjustment), thus satisfying one of the key requirements of HHS (2011) and referenced in HHS (2012). (10)
As with the primary analyses, a distributed model can support approved secondary analyses as well as or better than a centralized one. For example, a distributed approach would allow for reconciliation of cost-sharing reduction payments, verification of risk corridor submissions, and auditing of cost-sharing reductions or reinsurance payments, all of which could be accomplished with modular programs executed on the individual-level encounter data developed in phase 2 noted earlier. This is because the calculation of reinsurance payments under the formula proposed in HHS (2011, 2012) (equal to a fixed share of an enrollee's medical costs above an attachment point but below a cap) is individually separable, as is the calculation of risk corridor payments and charges (equal to a specified share of the difference between allowable medical care costs and premiums, net of administrative costs, risk adjustment payments or charges, reinsurance payments, user fees, and cost-sharing assistance payments).
One practical problem that both centralized and distributed approaches may need to address is the linking of individuals who switch health insurance issuers. If risk adjustment is purely concurrent, then linking is less of an issue. (11) In theory, under concurrent adjustment, an enrollee's risk score can be calculated using only demographic information and claims from the current issuer. (12) Concurrent models also use only current health and spending data for calibration purposes, so information on enrollees need not be linked across issuers in this context either. An enrollee who is with an issuer for only part of a year can have his claims experience weighted to reflect the shorter period for which he is observed. Prospective risk adjustment, in contrast, uses lagged health data to determine risk score, and lagged health data matched with current spending data for calibration. Thus, exchanges that use prospective risk adjustment will either need to link individuals who switch issuers or be limited to calculating risk scores based only on demographic information.
However, if risk adjustment is prospective, both centralized and distributed approaches will face this problem. In a recent question and answer paper about state exchange implementation, CMS (2011) stated that it did not propose and will not implement any risk adjustment method that calls for states or the federal government to collect personal data such as name, Social Security number, or address. Indeed, HHS (2012) prohibits states as part of risk adjustment data collection from gathering or storing any personally identifiable information for use as a unique identifier for an enrollee's data, unless that information is masked or encrypted by the issuer, with the key to that masking or encryption withheld from the state. Thus, it will be difficult for an exchange to link individuals across issuers even in a centralized approach. An exchange could effectively link individuals for purposes of prospective risk adjustment by requiring issuers to transfer individual-level health data, but not spending data, when an enrollee switches; (13) however, this would have to be imposed in either a centralized or distributed approach. As such, a distributed approach still dominates a centralized one on privacy grounds, insofar as it requires the transfer of less information (and only for a small subset of enrollees). In addition, methods to link individuals across issuers via anonymous methods that avoid requiring either party to disclose protected information to the other are being explored by the Mini-Sentinel program (FDA 2012), and may be applicable to a distributed approach to risk adjustment.
Risk adjustment is an essential element of the insurance market reforms adopted by the ACA. However, the statute does not specify how the data to support this function should be collected. As noted, there are three possibilities: a centralized approach, an intermediate approach, and a distributed approach. Understanding the strengths and weaknesses of these very different models is important. HHS (2012) leaves the selection of a data collection approach to the states to decide, so states will need to compare how each is likely to perform.
In this paper, I offer analytical support for the choice of a distributed model. I show that a distributed model functions as well as an intermediate or centralized one in terms of controlling errors and misreporting, ensuring methodological completeness, and providing support for other functions requiring encounter-level data; in terms of privacy, state flexibility, data quality, and administrative costs, a distributed model is probably, or surely, better. In particular, I show that essentially all risk adjustment methodologies that HHS or a state might want to undertake can be conducted just as well on a distributed basis as on an intermediate or centralized one. Data maintained on a distributed basis can be used to calculate risk adjustment payments based on a set of pre-specified risk weights; to recalibrate essentially all risk adjustment models; and to determine payments under the Transitional Reinsurance and Temporary Risk Corridor Programs. The statistical theory behind this is not new; it has been used as the basis for several successful research projects by private consortia and regulatory activities by the FDA and CDC.
The case for distributed models is even stronger because they have other benefits. Most important, use of a distributed model for risk adjustment could facilitate other analyses that have important public policy purposes. A distributed approach could mitigate many of the concerns of plans and enrollees, including unauthorized derivation of competitively sensitive information or the disclosure of personally identifiable information.
In particular, a distributed approach provides a viable platform for quality improvement activities. As McClellan (2011) and others have observed, the ideal data sources for quality measures overlap significantly with the sources used in risk adjustment. By reducing the incremental costs of quality improvement, a distributed approach to risk adjustment makes it more likely that quality improvement programs will be adopted.
Consider a linear regression model of medical expenditures, specified as a function of the characteristics of individual patients and health plans:
[Y.sub.itc] = [[delta].sub.t] + [R.sub.itc][beta] + [Z.sub.tc[gamma]] + [[epsilon].sub.itc] = [X.sub.itc[phi]] + [[epsilon].sub.itc]
i varies across individual patients
t varies across years
c varies across plans
Y is the medical expenditures of patient i during year t in plan c
[delta] is a time fixed effect, capturing all fixed differences over time
R includes individual patient characteristics, including demographics and health status
Z includes plan characteristics, such as copayment levels and actuarial value
[epsilon] is an error term.
For simplicity, consider the problem of estimating the coefficient vector [phi] using data from two plans, A and B, for a single year (the estimation problem on more than two plans or multiple years is similar). Some matrix algebra shows that the least-squares regression estimator of [phi],
[(X'X).sup.-1] X'Y, can be rewritten
[([X.sub.A]'[X.sub.A] + [X.sub.B]'[X.sub.B]).sup.-1 ([X.sub.A]'[Y.sub.A] + [X.sub.B]'[Y.sub.B]).
In this expression, the matrix [X.sub.A] has k columns, one for each variable in the joint analysis, ([X.sub.A.sup.I],..., [X.sub.A.sup.K]), and [N.sub.A] rows, one for each individual patient in plan A. [X.sub.B] is defined similarly.
Expressed in terms of individual elements of X rather than matrix notation, the least-squares estimator can be written as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Thus, simply by obtaining four matrices--[X.sub.A]'[X.sub.A], [X.sub.B]'[X.sub.B], [X.sub.A]' [Y.sub.A], and [X.sub.B]' [X.sub.B]--a state risk adjustment authority can conduct a joint regression analysis using both plans' data. However, none of these matrices contains any individual-level information each matrix has at most k columns and k rows because all the individual-level information has been summed together to form the cross-product matrices. Once pooled regression coefficients have been calculated, the standard errors of the pooled regression coefficients can also be calculated. To calculate heteroskedasticity-consistent (Huber-White) standard errors, each issuer needs to calculate [[epsilon].sub.itc] based on the pooled regression coefficients, and then calculate ([X.sub.c]'[OMEGA] [X.sub.c]), where [OMEGA] = [[ZIGMA].sub.i,t] [[epsilon].sub.itc.sup.2] [X.sub.itc]'[X.sub.itc]'.
Baggs, J., J. Gee, E. Lewis, et al. 2011. The Vaccine Safety Datalink: A Model for Monitoring Immunization Safety. Pediatrics 127 (Supp. 1):S45-S53.
Behrman, R. E., J. S. Benner, J. S. Brown, et al. 2011. Developing the Sentinel System--A National Resource for Evidence Development. New England Journal of Medicine 364(6):498-499.
Brown, J. S., J. H. Holmes, K. Shah, et al. 2010. Distributed Health Data Networks: A Practical and Preferred Approach to Multi-institutional Evaluations of Comparative Effectiveness, Safety, and Quality of Care. Medical Care 48(6):S45-S51.
Brown, J. S., J. Holmes, B. Syat, et al. 2010. Blueprint for a Distributed Research Network to Conduct Population Studies and Safety Surveillance. Effective Health Care Program Research Reports No. 27 (2010). http://www. effectivehealthcare.ahrq.gov/ehc/products/206/ 465/27forcoding6-30-10.pdf. Accessed April 10, 2012.
Diamond, C. C., F. Mostashari, and C. Shirky. 2009. Collecting and Sharing Data for Population Health: A New Paradigm. Health Affairs 28(2):454-466.
Ellis, R. 2008. Risk Adjustment in Health Care Markets: Concepts and Applications. In Paying for Health Care: New Ideas for a Changing Society, M. Lu and E. Jonnson, eds. Weinheim, Germany: Wiley-VCH Verlag.
Greene, W. 2012. Econometric Analysis 7th. ed. Saddle River, N.J.: Prentice-Hall.
McClellan, M. 2011. Comments on the NPRM. http://www.brookings.edu/~/media/Files/rc/ opinions/2011/103l_comments_on_final_rule/ Risk%20Adjustment%20NPRM%20Comment s %20Oct%2031%202011.pdf. Accessed April 10, 2012.
McClellan, M., and D. Kessler for the TECH Investigators. 1999. A Global Analysis of Technological Change in Health Care: The Case of Heart Attacks. Health Affairs 18(3): 250-255.
McClellan, M., and D. Kessler, eds. 2002. Technological Change in Health Care: A Global Analysis of Heart Attack. Ann Arbor: University of Michigan Press.
Pilote, L., O. Saynina, F. Lavoie, et al. 2003. Cardiac Procedure Use and Outcomes in Elderly Patients with Acute Myocardial Infarction in the United States and Canada, 1988 to 1994. Medical Care 41(7):813-822.
Roski, J., and M. McClellan. 2011. Measuring Health Care Performance Now, Not Tomorrow: Essential Steps to Support Effective Health Reform. Health Affairs 30(4):682-689.
U.S. Centers for Disease Control and Prevention (CDC). 2012. Vaccine Safety Datalink (VSD) Project. http://www.cdc.gov/vaccinesafety/ activities/vsd.html#5. Accessed April 10, 2012.
U.S. Department of Health and Human Services (HHS). 2011. Standards Related to Reinsurance, Risk Corridors, and Risk Adjustment, Notice of Proposed Rulemaking, Federal Register 76(136), July 15.
--. 2012. Final Rule on the Standards Related to Reinsurance, Risk Corridors and Risk Adjustment. March 20.
U.S. Department of Health and Human Services, Centers for Medicare and Medicaid Services (CMS). 2011. State Exchange Implementation Questions and Answers. http://cciio.cms.gov/ resources/files/Files2/11282011/exchange_q_ and_a.pdf. Accessed April 10, 2012.
--. 2012. Data Privacy and Release. https://www.cms.gov/ Research-Statistics-Data-and-Systems/Computer-Data-and-Systems/Minimum DataSets20/DataPrivacyandRelease.html. Accessed April 10, 2012.
U.S. Food and Drug Administration (FDA). 2011. FDA's Mini-Sentinel Program to Evaluate the Safety of Marketed Medical Products. http://www.mini-sentinel.org/work_products/ Publications/Mini-Sentinel_Progress-and-Direction.pdf. Accessed April 10, 2012.
--. 2012. Link of Distributed Database Environments. http://mini-sentinel.org/data_ activities/details.aspx?ID = 113. Accessed April 10, 2012.
The author would like to thank the AHIP Foundation and its Institute for Health Systems Solutions for support, and Mark McClellan and M. Kate Bundorf for helpful conversations. The opinions expressed do not necessarily reflect those of Stanford University or the AHIP Foundation.
(1) It also establishes transitional reinsurance and temporary risk corridor programs, discussed later. It should be noted that risk adjustment-as it distributes funds related to risk among plans does not address the issue of adverse selection against the market generally.
(2) In what follows, I refer to both an intermediate and a centralized approach as "centralized."
(3) A separable analysis is one that, for any two subpopulations of individuals, could be conducted independently on the subpopulations and then aggregated to produce exactly the same result that could have been obtained had the analysis been conducted jointly.
(4) The functions need not be separable across encounters as long as all of an individual's encounters for a given coverage period are contained within a single issuer.
(5) For example, if the average of a binary risk marker is p, then the variance of that risk marker must be p x (1 - p). If the average of one binary risk marker is p and the average of another is q, and the proportion of patients with both markers is r, then the covariance between p and q must be r - (p x q).
(6) The following discussion is based on Behrman et al. (2011) and U.S. FDA (2011).
(7) The following discussion is based on Baggs et. al (2011) and U.S. CDC (2012).
(8) The validation process should begin with the validator seeking to replicate summary statistics on variable distributions submitted to it by the issuer, in order to detect basic data transmission errors. The second step in the validation process would be to compare statistics across issuers, which would highlight more complex data collection mistakes that may have occurred.
(9) Nonlinear models are not in general separable across individuals because they are estimated by iterative methods. However, nonlinear models are not likely to be used for risk adjustment under the ACA because they offer few performance advantages over conventional linear models (Ellis 2008). Still, nonlinear models could be estimated on a distributed basis, iteration-by-iteration, although this would be difficult.
(10) A distributed approach can also be used to estimate a constrained linear regression, which means that it can set and remove age and smoking rating factors as part of the process of calibrating the risk adjustment model. This is because the parameters from a constrained regression can also be written as a function that is separable across individuals (Greene 2012, [section] 5.5.1).
(11) The fact that concurrent risk adjustment does not require linking across issuers, however, is only one factor among many in the choice among risk-adjustment models, which is an issue beyond the scope of this paper.
(12) Of course, to the extent that an enrollee has been with an issuer for a short period of time, his risk score will be measured imprecisely, but can still be constructed to be unbiased, even if risk markers are defined hierarchically. In this case, the mapping of claims into risk markers may need to depend on the length of the period for which the enrollee is observed.
(13) Lagged health, but not lagged spending, data is important in a prospective risk adjustment model.
Daniel P. Kessler, J.D., Ph.D., is a professor in the Law School and Graduate School of Business, and a senior fellow at the Hoover Institution, all at Stanford University. This paper received support from the AHIP Foundation. Address correspondence to Prof. Kessler at 434 Galvez Mall, Stanford University, Stanford, CA 94305. Email: email@example.com