BMC Health Services Research BioMed Central Research article Inter-rater reliability of nursing home quality indicators in the U.S

Background In the US, Quality Indicators (QI's) profiling and comparing the performance of hospitals, health plans, nursing homes and physicians are routinely published for consumer review. We report the results of the largest study of inter-rater reliability done on nursing home assessments which generate the data used to derive publicly reported nursing home quality indicators. Methods We sampled nursing homes in 6 states, selecting up to 30 residents per facility who were observed and assessed by research nurses on 100 clinical assessment elements contained in the Minimum Data Set (MDS) and compared these with the most recent assessment in the record done by facility nurses. Kappa statistics were generated for all data items and derived for 22 QI's over the entire sample and for each facility. Finally, facilities with many QI's with poor Kappa levels were compared to those with many QI's with excellent Kappa levels on selected characteristics. Results A total of 462 facilities in 6 states were approached and 219 agreed to participate, yielding a response rate of 47.4%. A total of 5758 residents were included in the inter-rater reliability analyses, around 27.5 per facility. Patients resembled the traditional nursing home resident, only 43.9% were continent of urine and only 25.2% were rated as likely to be discharged within the next 30 days. Results of resident level comparative analyses reveal high inter-rater reliability levels (most items >.75). Using the research nurses as the "gold standard", we compared composite quality indicators based on their ratings with those based on facility nurses. All but two QI's have adequate Kappa levels and 4 QI's have average Kappa values in excess of .80. We found that 16% of participating facilities performed poorly (Kappa <.4) on more than 6 of the 22 QI's while 18% of facilities performed well (Kappa >.75) on 12 or more QI's. No facility characteristics were related to reliability of the data on which Qis are based. Conclusion While a few QI's being used for public reporting have limited reliability as measured in US nursing homes today, the vast majority of QI's are measured reliably across the majority of nursing facilities. Although information about the average facility is reliable, how the public can identify those facilities whose data can be trusted and whose cannot remains a challenge.


Background
Health care providers' and insurers' accountability for the services that they render is increasingly a subject of concern to regulators, advocates and consumers [1]. As efforts to contain costs while increasing competition in the health care field have advanced in many countries, concerns about deteriorating quality of care now receive even more attention than health care costs. Measuring health care quality and comparing providers' performance has emerged as the most hopeful strategy for holding them accountable for the care they provide. [2] Now quality measurement, performance monitoring and quality improvement is a constant refrain in the entire sector in the US. [3] Hospitals regularly produce statistics regarding their performance in selected clinical areas and most are now surveying their patients about their satisfaction with the care they receive. [4,5] Insurers, particularly managed care companies, are routinely compared on how well they insure that preventive health services are delivered to their subscribers. [6] Surgeons' mortality rates are publicly reported in several US states while ambulatory practices' performance in holding down waiting times and measuring blood glucose levels is compared and providers are rewarded accordingly. [7,8] Finally, since late 2002 all nursing homes in the US are compared on numerous quality indicators developed over the past decade, and the results regularly advertised in local newspapers and posted permanently on a web site. [9][10][11] Measures of nursing home quality have frequently been proposed and used by researchers in the past, but generally only for a small number of facilities or in select groups of facilities. Until recently, most such measures were based upon aggregate data reported by the home as part of the federally required survey and certification process. [12,13] However, the federally mandated universal introduction of the Minimum Data Set (MDS) for resident assessment in all nursing homes in 1991 made it possible to construct uniform measures based upon common data characterizing all residents of all facilities. [13,14] The MDS was designed to improve the quality of clinical needs assessment to facilitate improved care planning for this increasingly frail population. [15] A comprehensive assessment is done upon admission to the facility parts of which are updated periodically thereafter with a complete reassessment done annually. As part of its changing approach to monitoring provider quality, in 1998 the government began requiring all nursing homes to computerize all the MDS assessments performed on all residents as a condition of participation in the Medicare and Medicaid programs. By 2002 over 10 million assessments a year were being entered into a national nursing home database.
Prior to and throughout the course of its implementation, the MDS was repeatedly tested for inter-rater reliability among trained nurse assessors in nursing homes, large and small, for-profit and voluntary, throughout the country. Results of these tests revealed adequate levels of reliability when the MDS was first implemented nationally in late 1990. [16] A modified version of the MDS was designed and retested in 1995 and was found to have improved in reliability in those areas with less than adequate reliability while sustaining reasonably high reliability in other areas. [17][18][19] While testing under research conditions revealed adequate reliability, other studies found comparisons of research assessments with those in the facility chart to be less positive. One study of 30 facilities found discrepancies in 67% of the items compared across residents and facilities but that often "errors" were miscoding into adjacent categories and the bias was not systematic (neither "up-coding" exacerbate nor "downcoding" to minimize the condition). Indeed, when reliability was assessed using the weighted Kappa statistic, the authors found that many items with poor absolute agreement rates did achieve adequate reliability. [20] The Office of the Inspector General undertook an audit in several facilities in 8 different states and also identified discrepancies between data in the chart and residents' conditions on independent assessment. [21] Analysis of observed discrepancies didn't differentiate between those within one category or those that differed by more than one category in an ordinal scale, suggesting that had a weighted Kappa statistic been used, the results would have been more comparable with those reported by Morris and his colleagues.
The availability of clinically relevant, universal, uniform, and computerized information on all nursing home residents raised the possibility of using this information to improve nursing home care quality. As with most efforts designed to improve health care quality, the incentives and the targets were multifaceted. First, government regulators anticipated that creating indicators of nursing homes' quality performance would guide and make more rigorous and systematic existing regulatory oversight processes that had been characterized as idiosyncratic. Secondly, the more enlightened facility administrators felt that such information could facilitate their own existing quality improvement activities. Finally, advocates for nursing home residents thought that making this information available would create greater "transparency" to guide consumers' choices of a long-term care facility.
Aggregate measures of nursing home quality based upon the MDS have been developed and tested in various contexts for over the past decade. Residents' clinical condition or processes problems in care are measured at the resident level and aggregated to represent the situation in a given facility. Zimmerman and his colleagues were among the first to develop, test and apply them. [22] Medical care quality process measures based upon medical record review have been proposed and the Joint Commission on the Accreditation of Health Care Organizations (JCAHO) has instituted a mandatory mechanism for reporting an outcome indicator data set for all nursing homes they accredit. [23,24] In 1998 the Centers' for Medicare and Medicaid (CMS) contracted with the authors' organizations to undertake a comprehensive review of existing QI's for nursing homes with an aim of modifying or developing new QI's on which to compare facilities with the ultimate purpose of reporting those publicly. [9] While this effort focused on all possible QI domains, most attention was focused on care processes and clinical outcomes. To address this gap, CMS issued another contract to develop QI's specifically designed to measure quality of life in nursing homes, but this effort remains in the developmental stage. [25] After a 6 month six-state pilot project using a sub-set of the newly revised clinical process and outcome quality indicators, the Centers for Medicare and Medicaid Services (CMS) began to publish on their web-site facility-specific, MDS-based quality measures for every Medicare/Medicaid certified nursing facility in the country. The quality measures, applied to both long-stay and short-stay post-acute nursing home residents, included items such as pressure ulcer prevalence, restraint use, mobility improvement, pain, and ADL decline. Advertisements were published in every major newspaper ranking most nursing homes in the community in the form of "league tables". Data on all measures for all facilities were included on CMS' "Nursing Home Compare" web site http://www.medicare.gov/ NHCompare/home.asp.
As part of a national study to validate the newly revised and developed quality indicators, we undertook the largest test of the inter-rater reliability of the MDS ever conducted in order to determine whether the data elements used in the construction of quality indicators are sufficiently reliable to be used as the basis for public reporting. Prior testing of the MDS had generally been done in select facilities so the current study sought to estimate reliability across all raters in all facilities. Since quality indicators represent a facility specific aggregation of particular patient characteristics recorded on the MDS, we sought to identify the degree to which there was variability in reliability across facilities.

Overview
Participating facilities in six states agreed to allow trained research nurses enter the facility, interview management staff, observe patient interactions and abstract a sample of up to 30 patient records. Research nurses conducted independent resident assessments of sampled residents by observing the patient, reviewing the chart and asking front line staff about the residents' behavior. Some 100 data elements collected as part of the research nurses' assessments were compared to the most recent MDS for that patient done by facility nurses. The Kappa statistic was calculated for each data element and for the composite QI's for all residents and separately per facility.

Sampling States, Facilities and Subjects
The final analytic sample for this study was comprised of 209 freestanding and hospital-based facilities located in six states: California, Illinois, Missouri, Ohio, Pennsylvania and Tennessee. States were selected for regional representation and size in terms of numbers of facilities. Facility selection was stratified based upon volume of post-hospital discharge, sub-acute care provided, as indicated by whether the facility is hospital based or not. Within these two strata, we sought to select facilities based upon their QI scores in the year prior to the study (2000) in order to compare historically poor and well performing facilities. A total of 338 non-hospital based facilities and 124 hospital-based facilities were approached about participating in the study.
We attempted to select 30 residents per facility. In nonhospital based facilities, the sample was comprised of 10 residents with a recently completed admission MDS assessment; 10 residents with a recently completed quarterly MDS assessment; and 10 residents with a recently completed annual MDS assessment. "Recently completed" assessments were defined as those that were completed one-month prior to the nurse researcher arriving at the site. If a sample could not be captured with recently completed assessments, the nurse assessors looked back as far as 90 days to fulfill the sample. In hospital-based facilities, the sample was the 30 most recently assessed patients.

Nurse Training and Data Collection
Research nurses were contracted from area Quality Improvement Organizations with experience doing quality review and assurance functions in nursing homes for the government. All research nurses participated in a fiveday training and certification program led by five experienced RN researchers from one of our research organizations. Two and one-half days of the program were devoted to training in how to conduct resident assessments using a subset of items from MDS since these research nurses were being trained to serve as the "gold" standard against which the assessments of facility nurses would be compared. The didactic portion of the sessions was provided by a clinical nurse specialist with over ten years experience in this area. The training manual included all corresponding guidelines for assessment from the standard MDS User's Manual. Trainees were instructed to follow the standard assessment processes specified in the User's Manual using multiple sources of information (e.g., resident observation, interviews with direct care staff, chart review). Scripted videotaped vignettes were presented to demonstrate interviewing techniques and to provide practice in coding. Trainees were paired for roleplaying exercises to practice their interviewing skills. Case presentations and follow-up discussion were used to illustrate assessment techniques and correct coding responses. To certify competency in MDS assessment, each trainee completed a case and met individually with the lead trainer for review.
The field protocol had two component parts. The nurse assessor first completed the MDS, talking with the resident and knowledgeable staff member and reviewing the medical record for the two-week assessment window. Once this was completed, the nurse assessor conducted a number of QI validation related activities, including conducting three "walk-thru's" of the facility to characterize the ambience of the nursing home and facility care practices, received and reviewed a self-administered survey completed by the Administrator or Director of Nursing of the facility, and completed a process related record review.
Nurse assessors were instructed to complete MDS assessments according to instructions provided in the Long Term Care Resident Assessment Instrument (RAI) User's Manual, Version 2.0 (October 1995). All relevant sources of information regarding resident status, including medical records, communication with residents and staff (including the CNA most familiar with the resident), and observation of residents, were to be utilized in determining the codes to be used for each of the 100 MDS items included in the reliability study. Per the RAI User's Manual, the medical record review was to provide a starting point in the assessment process. No additional guidance or criteria for assessment was communicated by the project team; thus, nurse assessors were expected to rely on clinical judgment and the face validity of the various data sources when making final determinations regarding MDS item coding. Finally, nurse assessors were instructed to complete MDS assessments prior to completing other data collection protocols, in order to ensure impartiality.
Two research nurses undertook data collection at each participating facility. Nurse researchers were required to complete at least two independent, paired assessments with their partner per facility. These cases were selected at random once the resident sample at each facility had been selected. Nurses were not to share findings until each of their assessments was complete and data entered (all data were entered into laptops by research nurses on site using customized software). Inter-rater review cases were submitted to project investigators. While there were not enough residents assessed by the same pair of raters to permit inter-rater reliability assessments for each research nurse, we pooled the paired reliability assessments done among the research nurses. In this way, we established the general inter-rater reliability of the research nurses as an overall group. These data made it possible to substantiate the degree of agreement among the research nurses to insure that they could be treated as the "gold standard".

Measures
The abbreviated version of the MDS contained over 100 data elements. These data elements included both dichotomous (e.g. dementia present yes or no) as well as ordinal elements (e.g. 5 levels of dependence in ambulation or transfer). Virtually all items included in the assessment were required in the construction of one of the 22 dichotomous cross-sectional QI's tested as part of the overall study. Only cross-sectional quality indicators could be tested for reliability since our reliability data were based upon a single point in time when we did the data collection in each facility. Longitudinal incidence or change quality indicators require measures of patient status at two consecutive assessments. Data elements included: cognitive patterns; communication/hearing patterns; mood and behavior patterns; physical functioning; continence in last 14 days; disease diagnoses; health conditions; oral/nutritional status; skin conditions; activity pursuit patterns; medications; special treatment procedures; and discharge potential and overall status. These items were selected both because they were incorporated into the construction of many of the QI's and because they constitute readily observable residents conditions as well as more complex clinical judgments.
Based upon these MDS data elements, a total of 22 crosssectional quality indicators were constructed. For example, the prevalence of a urinary tract infection (UTI) in the resident population is a quality indicator which is defined using a denominator that includes all residents except those who are comatose or on a hospice care program and a numerator defined as anyone in the denominator with a current UTI (a data element in the abbreviated assessment). Since we were testing the inter-rater reliability of the facility assessors in comparison with our research nurses, the QIs were measured and constructed at the patient level. Thus, in the case of the UTI indicator, only those patients who were in the denominator were used in the comparison of facility and research nurses. All other indicators were similarly constructed dichotomous items. Since some QI's have more denominator restrictions than others, the number of residents per facility used in calculating the facility specific reliability estimate varied somewhat from the maximum sample size.

Analytic Approach
The approach used to test inter-rater reliability is the Kappa statistic, or the weighted Kappa for ordinal measures such as ADL performance, etc. [26][27][28] This statistic compares the two sets of raters who have each observed and assessed the same patient independently. However, rather than merely calculate the percentage of cases on which they agree, the Kappa statistic corrects for "chance" agreement, where "chance" is a function of the prevalence of the condition being assessed. It is possible that two raters could agree 98 percent of the time that a resident had episodes of disorganized speech. However, it might be the case that when one rater thought disorganized speech was present the other never agreed. In this instance, in spite of the fact that the level of agreement would be very high, the Kappa would be very low. [29] Depending upon the importance of the assessment construct, having a low Kappa in the face of very high agreement and high prevalence could be very problematic, or trivial. However, since some quality indicators have relatively low prevalence, meaning that infrequent disagreements might be very important, we were quite sensitive to this possibility. For this reason, we present the Kappa statistic as well as the percentage agreement of the facility raters relative to the "gold standard" research nurses. The weighted and unweighted Kappas are identical for dichotomous (binary) measures such as all the Quality Indicators (presence or absence); however, the ordinal measures like ADL or cognitive decision-making are more appropriately assessed with the weighted Kappa.
The quality indicators are supposed to reflect the performance of a facility viz. a given aspect of quality. The reliability of each QI is actually a function of the reliability of the constituent data elements. [30] Even if the QI is composed of ordinal data elements (e.g. level of dependence in mobility), the QI definition of the numerator is based upon a specific "cut-point" which results in a dichotomous variable. Thus, in most instances the inter-rater reliability of a QI measured across numerous patients in a facility will be lower than that of most of the constituent elements, particularly if these are ordinal measures. Kappa statistics were calculated for all constituent data elements for each of the 22 QI's as well as for each QI for each facility in the study.
By convention, a Kappa statistics that is .70 or higher is excellent whereas a Kappa statistic that is less than .4 is considered unacceptable and levels in between are acceptable. We apply these conventions in our interpretation of the inter-rater reliability data, both of the individual MDS data elements as well as the composite, dichotomous Quality Indicators. The number of pairs of observations per facility is between 25 and 30. This number of observa-tions yields a fairly stable estimate of inter-rater reliability to characterize the facility, given that the observations are representative of the residents and nurse raters in the facility and conditional on the relative prevalence and distribution of the condition (e.g. dementia or pain) in the facility. In some instances, restrictions on the residents included in the denominator of a QI results in reducing the number of paired comparisons within a facility. We set an arbitrary minimum number of paired cases needed to calculate the Kappa statistic at 5. The confidence interval around an estimate of the Kappa is a function of the absolute percentage agreement, the prevalence, or variance, of the condition as well as the number of pairs being compared. Holding constant the prevalence and agreement rate, the size of the confidence interval is primarily related to the number of observations. For a facility with 30 paired observations, the approximate 95% confidence interval is +/-.25 whereas for only 5 observations it is +/-.65. This lower threshold was almost never reached for any of the participating facilities. Since most measures in almost all facilities were based upon 25 or more residents, the results section doesn't present confidence intervals, preferring to provide information on the prevalence of the condition.

Results
A total of 462 facilities in 6 states were approached and 219 agreed to participate, yielding a response rate of 47.4%. The response rate for hospital-based facilities (N = 65 participating) was 52.4% and the response rate for free-standing facilities (N = 154 participating) was 45.6%. Of the 219 facilities that participated in some part of the over all study, 10 (6 hospital based) chose not to participate in the inter-rater reliability component. Participating facilities were of similar size (average of 110 beds), but were less likely to be part of a chain (52.5% vs. 58.4%) or to be proprietary (50.2% vs. 61.7%).
A total of 5758 residents were included in the inter-rater reliability analyses, around 27.5 per facility. Patients resembled the traditional nursing home resident, only 43.9% were continent of urine, 1.7% were coded as having end stage disease and only 25.2% were rated as likely to be discharged within the next 30 days (most of these were in hospital based facilities).
The average gap between the facility rater assessment and the gold rater assessment was 25 days (SD = 27) and under 2% were beyond 90 days (primarily long stay residents). We examined whether facility and gold raters in agreement on each quality measure differed from those that disagreed in terms of the length of time elapsed between their assessments. We found no significant differences for any of the 22 quality measures. Under 10% of facilities had an average number of days between the research and facility assessments that was greater than 30 days and when the QI Kappa values for these facilities was compared to those with shorter intervals, we found no statistically significant differences on any QI Kappa. Thus, all assessments of both the facility and the research nurse assessors were included in all reliability analyses.
A total of 119 patients were independently assessed by two research nurses. Table 1 presents the results of the comparisons for a number of the individual data elements included in the assessment. Almost all the data elements reveal Kappa values in the excellent range and only 3 were found to be in the poor range. Inter-rater reliability was calculated for all 100 data elements and only the 3 items shown were found to be in the "poor" range. Most not shown had Kappa values resembling those shown. Those data elements where the weighted Kappa value is higher than the simple Kappa are ordinal response measures. Additional variation in the distribution generally results in higher Kappa values. However, even for the 5 category ordinal response measures like dressing or pain intensity, we found very high rates of absolute agreement suggesting that these research nurses really are assessing patients in the same way as can be seen in table one below.
where i, j are row and column number, and g the number of groups ** weighted kappa inflated with the function sbicc = (2*kw) /(2*kw + (1 -kw)) where kw is the weighted kappa Using the research nurses as the "gold standard", we compared their ratings with those of the facility nurses manifest in the MDS in the record. The inter-rater reliability of the MDS assessment items between the "gold standard" and facility nurses revealed that while 15 of the data elements had an "excellent" Kappa value in excess of . The relatively positive findings on the overall performance of the QI's belies considerable inter-facility variation. We classified facilities in terms of the absolute count of the 22 QIs for which they had a kappa exceeding .75 versus the absolute number of QIs for which the kappa fell below .40. This basically contrasts the likelihood that a facility has QIs with unacceptable Kappa values with the likelihood that they have exceptionally good reliability on some QIs. These two values were then plotted against each other to visually identify facilities that had relatively few HIGH kappa values while having an exceptionally large number of LOW kappas and the reverse. There are clearly facilities in the off-diagonals indicating that they performed very well on some QIs but also performed quite poorly on a reasonably high number of QIs. Nonetheless, as can be seen in Figure 3, the correlation between these two views of facility QI reliability of measurement was good (-.67) with the average facility having nearly 10 QI's with kappa values in the excellent range and around 6 in the "poor" range. In light of the substantial inter-facility variation in QI reliability, we sought to identify facility characteristics that might be related to data reliability. There are 35 nursing homes with six or more low kappa values (less than .40) which we compared with the 40 nursing homes with twelve or more kappa values in excess of .70, using as the reference group the majority of nursing homes (n = 144) meeting neither threshold. As can be seen in Table 3, the "poor performers" did not significantly differ from the high performers or intermediate performers on facility occupancy rate or the percent of Medicaid or Medicare residents in the facility. In addition, there were no differences in the average acuity of residents at admission or during quarterly assessments (based on the nursing case-mix index values used to create case mix classification). [31] Finally, there were no significant differences between the facilities in the number of health-related deficiencies cited during the most recent state survey process which we standardized per state. [32] Table 3 below depicts the differences.

Discussion
This study represents one of the largest inter-rater reliability trials ever conducted, involving over 200 nursing facil-Facility Kappa Values Comparing "Gold Standard" Raters withFacility Nurses: Incontinence Quality Indicator

Figure 1 Facility Kappa Values Comparing "Gold Standard" Raters withFacility Nurses: Incontinence Quality Indicator
The distribution of Kappa values averaged for all residents in each facility reflecting the inter-rater reliability of the "gold standard" nurses and facility nurses on the Incontinence quality indicator. The "Y" axis indicates the number of facilities and the "X" axis the facility inter-rater reliability level calculated for the Incontinence QI.  ities and nearly 6000 pairs of nursing home residents. Relative to research nurses with proven high levels of inter-rater reliability who can be treated as the "gold standard", we found reasonably high average levels of inter-rater reliability on resident assessment information that is the basis for publicly reported measures of nursing home quality. We also found that almost all the composite quality indicators measured in the average nursing facility in the six states studied, achieved adequate to good levels of inter-rater reliability. However, we did observe substantial facility variation in QI reliability. The majority of facilities participating in the study had reasonably good reliability on most quality indicators, but a minority of facilities revealed unacceptably poor levels of reliability on many quality indicators. Unfortunately, traditional organizational or resident case-mix measures did not differentiate between facilities with high levels of QI reliability and those with low levels of QI reliability.
These findings are quite consistent with various prior studies of the reliability of the MDS as an assessment instrument for general use in the nursing home setting. [16][17][18] Earlier studies were based upon convenience samples of facilities located close to the investigators and   This study may also serve to better understand the results of several other prior studies of the reliability of the MDS assessment process since we do observe considerable inter-facility variation and variation in which data elements are likely to have reliability problems. The multifacility studies done by Abt Associates and by the General Accounting Office found random (as opposed to directionally biased) disagreement, particularly on the ordinal, multi-level items used to assess functional performance. It is likely that use of a weighted Kappa might have revealed results more comparable to those presented here. On the other hand, our finding of considerable inter-facility variation in measurement may suggest that the selection of the facility for participation is influential in determining the results.
The reliability of the data used to construct quality indicators or measures of performance for health care providers has only recently emerged as an important methodological issue. [33,34,30,35] Kerr and her colleagues found reasonably good correspondence between computerized records in the Veteran's Administration's clinical data bases and more detailed medical charts which gave them greater confidence that the performance measures they were calculating really reflected facility quality in the area they were examining. Scinto and colleagues as well as Huff, found that while simple process quality indicators were highly reliable when abstracted from records, more complex measures requiring data elements related to eligibility for the treatment had significantly lower reliability levels using the Kappa statistic. Since the MDS represents the end-point of a complex clinical assessment of residents' needs, characteristics and the processes of care provided, it is encouraging to find reasonably high levels of reliability when two independent assessors undertake the same process of documentation. However, the more complex and subjective (less subject to direct observation) the assessment, the lower the reliability levels. of facilities which have high levels of inter-rater reliability even on the most "difficult" quality and functioning concepts. This suggests that it is possible to improve the quality of assessment and data. There may be something about how the assessment process is organized and documented and whether the clinical assessment and care planning process are fully integrated that influences the level of data reliability. Unfortunately, we know little about the organizational and supervisory processes that are associated with implementing thorough clinical assessments. There is some evidence that the initial introduction of the MDS was associated with improvements in care processes and outcomes, but how universal this is at present is not known. [36,37] Unfortunately, preliminary analyses of the structural factors associated with QI reliability levels provided little insight as to what kinds of facilities are more likely to have adequate reliability. Indeed, even facilities with poor government quality inspection records were no more or less likely to have excellent or poor QI reliability levels.
Since our measure of reliability, Kappa, adjusts for chance, particularly penalizing any disagreements in the assessment of "rare" conditions, we do observe the well-documented relationship between reliability and prevalence. [29] While considerable statistical research has been devoted to making adjustments in the Kappa statistic for prevalence, the fact remains that disagreements about whether something rare is actually present are highly problematic. [38,28] Of interest in the case of QI reliability is that some quality problems will, in reality, be less prevalent in high quality facilities. Theoretically, this could result in lower levels of reliability precisely because the observed conditions are less prevalent. This fundamental measurement problem viz. quality measurement reliability and true quality deserves considerably more attention in the methodological and quality improvement literature.
To the extent that the quality indicators now in use to characterize the performance of nursing homes through- out the United States are relied upon by consumers and regulators, our findings suggest that the reliability of the indicators will vary by facility. While most facilities have adequate to excellent data reliability on the bulk of QI's, there are others with more mixed, or generally poorer reliability. Since, at the present time, the government has no mechanism for assessing data reliability on an ongoing basis, those using the publicly reported information will have no idea as to whether the observed QI rate reflects error or real facility performance. Efforts to automatically examine issues related to data quality or to incorporate that as part of the annual facility inspection process should be explored if we are to use nursing home QI's as they are intended. [39,40] There are various limitations in the current study. First, while we were drawing a stratified random sample of facilities in each of six states, we experienced less than a 50% response rate. There is some indication that non-participating facilities were smaller, proprietary and rural. It is likely that the performance of participating facilities might differ systematically from that of those that refused; however, it is not clear in which direction the difference might be. Indeed, among participants, these factors were unrelated to QI reliability levels. Obviously, asserting that our research nurses were actually the "gold standard" is subject to debate. While they adhered to the assessment protocol in which they were trained and which should mimic that which is done in all US nursing facilities, depending upon how assessments are routinely done in participating facilities, research nurses clearly did not have the benefit of more extended observation periods or personal knowledge of patients' condition. Whether the research nurses represented the "truth" on all assessment items, it is clear that they were highly consistent among themselves, making them a reasonable yard stick against which existing assessments in each facility could be compared.

Conclusions
In summary, our study suggests that by and large, the MDS based nursing home quality indicators now being publicly reported are reliable. While there is variation in the level of reliability across the different indicators which probably should be noted in the public reports and some facilities clearly have less reliable data than do others, most of the indicators are replicable and most facilities are measuring them reliably. It is imperative that the organizational factors, leadership practices and assessment processes that are associated with high and low levels of data reliability be carefully scrutinized. The fact that nearly half of all participating facilities had inadequate reliability levels on some of the publicly reported QI's, could serve to undermine public confidence in the quality of information about nursing home quality.