Influence of selection of sample and accuracy of observations on correlation and regression results
Methods of determining linear and curvilinear regressions, together with appropriate measures of their significance and accuracy, have been set forth in previous chapters. These methods do not yield results representative of the universe from which the sample observations have been drawn, however, if that sample is not truly representative of the particular relation being determined in the universe from which the sample is drawn. There are various ways in which the sample may fail to represent the universe, and the resulting extent to which the correlation constants will be biased will vary both with the character of the unrepresentativeness and with the individual coefficients. Each type of abnormality must therefore be treated separately.
The samples may be selected from the universe in such a way as to exclude all the observations falling beyond a certain value of a given variable, thus ruling out values either at one or at both extremes, or perhaps ruling out middle values and selecting only extreme ones. This may be done for either the dependent variable or the independent variable or variables, or for both together. Such a selection of observations produces certain specific effects upon the correlation constants. Under some conditions it may be very desirable to select the observations in this way, if the resulting aberrations in the correlation constants are recognized and allowed for.
A second and somewhat more difficult type of problem to deal with arises when there are errors of measurement in obtaining the values of one or more of the variables--such errors as might arise, for example,