On the validity of area-based income measures to proxy household income

Background This paper assesses the agreement between household-level income data and an area-based income measure, and whether or not discrepancies create meaningful differences when applied in regression equations estimating total household prescription drug expenditures. Methods Using administrative data files for the population of BC, Canada, we calculate income deciles from both area-based census data and Canada Revenue Agency validated household-level data. These deciles are then compared for misclassification. Spearman's correlation, kappa coefficients and weighted kappa coefficients are all calculated. We then assess the validity of using the area-based income measure as a proxy for household income in regression equations explaining socio-economic inequalities in total prescription drug expenditures. Results The variability between household-level income and area-based income is large. Only 37% of households are classified by area-based measures to be within one decile of the classification based on household-level incomes. Statistical evidence of the disagreement between income measures also indicates substantial misclassification, with Spearman's correlations, kappa coefficients and weighted kappa coefficients all indicating little agreement. The regression results show that the size of the coefficients changes considerably when area-based measures are used instead of household-level measures, and that use of area-based measures smooths out important variation across the income distribution. Conclusion These results suggest that, in some contexts, the choice of area-based versus household-level income can drive conclusions in an important way. Access to reliable household-level income/socio-economic data such as the tax-validated data used in this study would unambiguously improve health research and therefore the evidence on which health and social policy would ideally rest.

individual-level income information for the populations they study. In the absence of individual-level income data, investigators often supplement health research datasets with group-based measures such as area-based average income constructed from national census data [6]. Such measures are used as proxy for individual-level income data on the assumption that household incomes will be reasonably homogeneous within small enough residential areas. If, however, there is significant heterogeneity in the areas used, then the aggregate measures can result in ecological fallacy-when an association observed between variables at an aggregate level does not represent the association that exists at an individual level [7].
Prior studies have investigated misclassification of income and other socio-economic variables by comparing individual versus area-level survey responses for small samples of the population [8,9] and by comparing surveybased measures for different sized census areas [10,11]. Using a unique dataset that contains validated household income data for approximately 78% of the population of British Columbia (BC), we investigate the level of misclassification that can occur when census-defined, area-based income is used as a proxy for an individual's actual household-level income. As the question of most interest to researchers concerns how well aggregate variables perform when they are entered in health outcomes equations, we then assess the sensitivity of the analysis of health related inequities in total prescription drug costs to whether income is measured as an area-based variable or an household-level variable.

Data
Our primary datasets are administrative files for the provincially administered, universal public medical and hospital health insurance program, Medical Services Plan (MSP) of BC. This program covers virtually all 4.2 million residents of BC, excluding only those residents covered by federal health insurance programs (collectively about 4% of the population). We restrict our attention to households for which one or more member resided in BC for at least 275 days per year from 2001 to 2004, inclusive.
Household income was obtained from the 2004 registration files for provincially administered, universal public pharmaceutical insurance program, BC PharmaCare. In addition to programs for social assistance recipients and other select populations, BC PharmaCare began offering income-based public drug coverage to all residents of the province in May 2003. Terms such as deductibles and coinsurance are based on household income, with more generous but still income-based coverage offered to senior citizens (residents aged 65 and older). For all households that registered to receive coverage, the BC Ministry of Health obtains net, pre-tax income information from the Canada Revenue Agency. Because of differences in coverage offered and average needs, 95% of households with one or more senior member were registered for Fair Phar-maCare in 2004 whereas only 73% of non-senior households were registered.
The area-based income variables used in this study are based on linking MSP registry postal codes to average household income in the area as recorded in the 2001 Census. Statistics Canada collates average household income and composition for over 7,000 Census Dissemination Areas comprised of 400 to 700 persons. For research purposes, these areas are sorted by income and aggregated into 1,000 strata. Income strata contain an average of 1,700 households, with some variation due to variations in populations by postal code. Both the household level and area-based income variables are based on the same income concept, gross income prior to any deductions.
Total individual expenditures on prescription drugs were obtained from BC PharmaNet. BC PharmaNet is an administrative dataset in which every prescription dispensed in the province must be entered by law-it is designed to support drug dispensing, drug monitoring and claims processing. These individual expenditures were aggregated at the household level according to registration files for the MSP program to create a variable indicating total household spending on prescription drugs.
The research data were extracted for this study from the British Columbia Linked Health Database and the BC PharmaNet database with permission of the BC Ministry of Health and the College of Pharmacists of BC. Ethics approval was obtained from the Behavioural Research Ethics Board at the University of British Columbia.

Statistical methods
The household-specific and area-based income measures were each aggregated into deciles (ordered from lowest to highest income). We assess the discrepancy between the two measures using the CRA validated, household-specific incomes as the standard. We calculated the Spearman's rank correlations of the various income measures, and both the kappa and weighted kappa to measure the degree of non-random agreement and partial agreement between the measures.
We proceed to examine whether the choice of income measure has an impact on how pharmaceutical expenditures are distributed by income status. We begin by examining the distribution of prescription drug expenditures by income deciles, where the deciles are defined according to household-level income then according to neighbour-hood level income. As measurement error is accommodated more easily in regression analysis than in descriptive analysis, we also include a series of dummy variables for both versions of the income variable in an OLS regression in order to determine whether both areabased income and household income generate meaningfully different results when applied in a research context. We perform regressions of income on total drug expenditures with and without covariates controlling for the presence of one or more seniors in the household as well as household size. Through the comparison of coefficients between household-level income variables and area-level income variables, one can reach some conclusions about the appropriateness of substituting an area-based measure for a missing household-level variable in a regression equation. By including regressions with and without covariates, we can determine whether multivariate models influence the discrepancy between area-based and household-level variables.

Results
A total of 1.74 million households were registered for MSP and had valid postal codes for linkage with areabased income strata. This cohort accounts for 95% of the total population in the province. Of these households, 1.36 million were registered for the Fair PharmaCare program. Cross-tabulations of the household-level and areabased income measures are shown in Table 1, where NR indicates the percentage of households in each area-based decile who were not registered for the Fair PharmaCare program at the time of data collection. This table confirms that rates of participation with the income-based program are lower in higher income neighbourhoods. This concentration of low incomes for the household-level income variable is because the registration for income-based drug coverage involves a degree of self-selection bias. To adjust for this, our tables below present a "best case" scenario wherein all non-registered households are assigned a hypothetical household-level income variable that is identical to their area-level income. Table 2 shows the level of discrepancy between the household-level and area-based income measures. The areabased measures classify 15.6% of senior households and 14.9% of non-senior households as being within the same income-decile as is determined by tax-reported household income. Approximately a third of non-senior households and two fifths of senior households are classified by areabased measures to be within one decile of the classification based on household-level incomes. In the "best-case" scenario, just over half of non-seniors and approximately 43% of seniors are within one decile of their householdlevel income.
Statistical evidence of the disagreement between income measures can be found in Table 3. The Spearman's correlations between the actual household-level income and area-based measures are always less than 0.40, suggesting little agreement. The kappa coefficient of non-random, complete agreement never exceeds 0.31 indicating very little complete agreement between area-based and actual household-level deciles even under the assumption of perfect correlation between area-based and householdlevel measures for all non-registrants. Again, when examining the weighted kappa coefficients, incorporating partial agreement, we see that they never exceed 0.5, even in the best-case scenario.
To examine whether these discrepancies result in any meaningful differences in an applied research context, we start by examining the distribution of total prescription drug expenditures by income deciles stratified by senior and non-senior households, first using household-level CRA validated income and then using aggregate neighbourhood level income (Table 4). Table 4 indicates that total prescription drug expenditures appear more equally distributed when we rank households by neighbourhood income than by household-level income, suggesting that neighbourhood level income masks variation in the underlying household-level income variable.
In Table 5 we estimate the effect of household income on total prescription drug expenditures by using both household-level income and neighbourhood level income in separate regressions. The dummy variable for the highest income decile was not included in the regression; thus, the coefficients can be interpreted as the difference in total prescription drug costs between each income decile and the highest income decile. The regression results also reflect the pattern noted in Table 4. While the signs never differ, the household-level variables pick up a substantially larger coefficient than the corresponding neighbourhood-level variable. This again suggests that the neighbourhood-level variables are smoothing the distribution of total prescription drug expenditures across income deciles. While the coefficients on income deciles differ substantially between the two models, it is interesting to note that the coefficients on presence of seniors and household size do not. Both coefficients are in the same direction and are of the same magnitude indicating that the difference in income variable does not have a large effect on other coefficients in the model. The model based on household-level income also reports a higher adjusted R [2] statistic than that using the area-based measure, indicating that the goodness of fit is higher in the regression using household-level variables. We also find that the inclusion of covariates in the model does not attenuate the bias between the variables substantially (Table 5).

Discussion
We found a sufficient level of discrepancy between the area-based and household-level income measures. Using validated household income as the standard, area-based measures misclassified the income decile for eighty-five percent or more of the households in the data. We also found that these discrepancies did affect the size of coefficients in regression analyses, suggesting that very different conclusions can be reached regarding the 'same' issue depending on which income variable we use. Thus, these results indicate that, at least in some contexts, the choice of neighbourhood versus household income can drive conclusions in an important way. Our results are consistent with a large amount of work indicating substantial discrepancy between area-based and household SES measures [2,6,8,10].
There are also a couple of important caveats. The first is that our study did not examine the inclusion of income as simply one of several control variables, but rather only looked at the difference between household-level and Note: In Best-case scenario, non-registered households are assigned a hypothetical household-level income decile that is identical to their areabased measure. area-level income when applied as the primary variable of interest. Thus, results cannot be extended to the use of income as a control in much larger regression equations. Second, these results are not meant to suggest that the use of neighbourhood income is inferior in all contexts. An author particularly concerned with measuring permanent income free of yearly fluctuations may find that neighbourhood income provides a better measure. When measuring access to health care, it might also be true that lowincome families living in high-income neighbourhoods have better access to care than other similar low-income families simply because of where they live. Thus, an argument could be made for including both measures in this type of work.
While the level of agreement between area-based and household-level SES measures has frequently been studied, our work adds to the knowledge base for several reasons. It encompasses a larger number of Canadians, a sample of 78% of all households in British Columbia, of which 95% of all senior households are analyzed. Also, while other studies have tended to compare area-based measures to household-level survey data [6,8,9] or have compared two or more different sized area-based measures [10,11] we have used highly reliable household-level income data validated with the Canada Revenue Agency. Therefore, we have been able to avoid all self-reporting bias, we have a great deal of confidence in our householdlevel income variable, and we have been able to analyze almost an entire population of a Canadian province.

Conclusion
While many authors have argued that household-level income should be used whenever possible, census-based aggregate measures will continue to be necessary for health research until household-level data become more readily available. Two suggestions can be made based on these research results. The first is that researchers should be cautious when interpreting the results of studies using aggregate measures as proxies for individual and household income. Area-based measures are approximations that are best suited to investigating major differences in incomes (e.g., differences of two or more quintiles) or to studying context in which someone lives rather than their specific income. The second suggestion is perhaps obvious to researchers but important for governments and statistical agencies to fully understand: access to reliable individual-level income/socio-economic data, as well as the neighbourhood level income data that is currently available, would unambiguously improve health research and therefore the evidence on which health and social policy would ideally rest.