Assessing methods for measurement of clinical outcomes and quality of care in primary care practices

Purpose To evaluate the appropriateness of potential data sources for the population of performance indicators for primary care (PC) practices. Methods This project was a cross sectional study of 7 multidisciplinary primary care teams in Ontario, Canada. Practices were recruited and 5-7 physicians per practice agreed to participate in the study. Patients of participating physicians (20-30) were recruited sequentially as they presented to attend a visit. Data collection included patient, provider and practice surveys, chart abstraction and linkage to administrative data sets. Matched pairs analysis was used to examine the differences in the observed results for each indicator obtained using multiple data sources. Results Seven teams, 41 physicians, 94 associated staff and 998 patients were recruited. The survey response rate was 81% for patients, 93% for physicians and 83% for associated staff. Chart audits were successfully completed on all but 1 patient and linkage to administrative data was successful for all subjects. There were significant differences noted between the data collection methods for many measures. No single method of data collection was best for all outcomes. For most measures of technical quality of care chart audit was the most accurate method of data collection. Patient surveys were more accurate for immunizations, chronic disease advice/information dispensed, some general health promotion items and possibly for medication use. Administrative data appears useful for indicators including chronic disease diagnosis and osteoporosis/ breast screening. Conclusions Multiple data collection methods are required for a comprehensive assessment of performance in primary care practices. The choice of which methods are best for any one particular study or quality improvement initiative requires careful consideration of the biases that each method might introduce into the results. In this study, both patients and providers were willing to participate in and consent to, the collection and linkage of information from multiple sources that would be required for such assessments.


Background
Primary care, the first point of contact between patients and the health care system, includes disease prevention, health promotion, and chronic disease management. For over a decade, improving primary care has been a key element of health system reform around the world to improve health and health outcomes and reduce the cost of health care [1][2][3][4][5][6][7][8][9] A program of research to evaluate the impacts of such reforms is essential [10,11]. The Canadian Institute for Health Information (CIHI) has proposed a comprehensive set of primary care outcome indicators [12], and also noted significant gaps in the availability and quality of the data sources to populate these indicators [13]. There is a need for a set of validated field tested measurement tools to facilitate the population of these indicators [14]. These tools could also be used for quality improvement initiatives in primary care practices and to enhance accurate reporting of performance for pay for performance incentives [15]. Administrative data, electronic health records, chart audits, patient surveys, provider surveys, population level surveys and direct observation have all been used for performance measurement in primary care, and could be used in the evaluation of primary care reform initiatives [14,[16][17][18][19][20][21][22][23][24][25][26][27][28].
The specific aim of this study was to attempt a comprehensive measure of the quality of primary care provided by community based, multidisciplinary, primary care practices and to assess which measurement methods were best for which elements of care. This paper focuses on indicators related to the CIHI objectives on the delivery of "comprehensive" and "high quality and safe primary health care services" [12]. Five of the proposed comprehensive care indicators and fifteen proposed quality of care indicators under this objective were selected for inclusion (Table 1). Our underlying hypothesis was that the method of measurement would have a significant impact on the results obtained for many outcomes. An improved understanding of the types, magnitude, and direction of bias or error introduced by particular methods was identified as being required to aid in the appropriate selection of methods for practice level performance reporting and quality improvement activities.

Setting and participants
This cross sectional study was set in seven Family Health Teams (FHTs) in Eastern Ontario. FHTs are multidisciplinary group practices that share one of three non-fee for service funding mechanisms and receive support for information technology and integration of allied health providers into the practice . A convenience sample of 7 teams was approached and all agreed to participate. Within each FHT 5-7 physicians were selected for participation as determined by local factors such as office locations and division of larger sites into functional units. In office, sequential recruitment of 20-30 patients per physician was conducted over a 9 month period in 2008. Recruitment was conducted on regular clinic days where a mix of patients were being seen. Providers and practices were not informed about which patients had agreed to participate in the study. Each participating practice, physician and associated nursing or allied health professional (AHP) was asked to complete a survey and the physicians were asked to consent to identification of their information in administrative data sets. Each patient was asked to complete a 2 part survey and consent to a chart audit and linkage of their survey and chart audit data to administrative data sets. Physician, AHP, and practice survey data was collected for measurement of other outcomes (for example team functioning) that were provided to the teams as a part of our feedback process and are reported elsewhere [29]. Table 1 presents a summary of the CIHI indicators selected, their definitions and the data sources that could be used to populate them. In addition, data on specific guideline related outcomes for the chronic diseases included were also collected, including detailed medication use data, neuropathy screening in DM, and control of hyperlipidemia.

Patient survey
This survey consisted of 2 sections. The first section was completed in the waiting room before the visit with the provider. It captured patient descriptive information and elicited patients' experiences of the practice's performance on measures covering a broad range of dimensions of health care service delivery. The second section was completed after the appointment with the provider and captured visit-specific information, including measures of activities related to health prevention, promotion and chronic disease management. Questions were derived from other validated survey tools, including the Primary Care Assessment Tool (PCAT-Adult), the Patient Perceptions of Patient-Centredness(PPPC), the Canadian Community Health Survey and the National Physician Survey [22][23][24][25][26].

Chart audit
The chart audit forms captured information for four thematic areas: 1) patient demographic information, 2) visit activities (including referrals, prescriptions and orders), 3) chart organization and 4) measures of performance of technical quality of care, including prevention, and chronic disease management. Chart abstractors were all provided standardized training and detailed written support material, which was based on those used in another major study of primary care models in Ontario [21]. For items such as colorectal cancer screening where multiple options for screening are available information on each individual method was collected then a calculated value for completion determined during data analysis. While all practices in this study had electronic health records (EHRs), we used a trained research associate to extract data rather than any automated data search strategy. This allowed a search of free text areas of EHRs, supplementary paper records (which were still retained or in use in many practices), scanned image files of reports, and old charts. Questions arising during a chart abstraction were emailed centrally and resolved with input from the investigators. Independent re-abstraction of a sample of 60 charts was conducted to validate the data extraction process. The discrepancies were adjudicated by a third party and all disagreements between chart abstractors were recorded and tallied. There was over 95% agreement between abstractors. The final data set was adjusted with the consensus value when discrepancies were noted.

Administrative billing data review
The Institute for Clinical Evaluative Sciences (ICES) is an agency supported by the government of Ontario yet operating at arms length, which is charged with analysis of health sector administrative data in Ontario. Data holdings include a number of databases with information on providers and their practices, such as physician billings, drug utilization for publicly funded prescription medications, hospital inpatient and emergency room care, and census data. Consent from participating patients and providers to access their related administrative billing data was obtained for all physicians and patients. For those measures for which administrative data was available, a performance score using this data was determined using algorithms developed for earlier studies [18]. A data set with the results for each patient was created and linked to the chart and survey data using the ICES Key Number. The ICES key number (IKN) is a unique identifier assigned based on the Ontario Health Insurance Plan(OHIP) number of the patient. Provider ID numbers at ICES are based on College of Physicians and Surgeons of Ontario registration numbers and can also be linked in similar manner. All other study data was indexed using anonymous study ID numbers. Profiles of participating physicians and practices were created to allow comparison between study patients; all patients of study physicians(to assess bias introduced by recruitment method for patients); and a

Data management and analysis
Survey and audit data were entered into SPSS version 16. Statistical analysis of comparisons between survey and chart audit data was also conducted in SPSS. Any analysis including comparisons to administrative data was conducted in SAS version 9.2. The results obtained from each data source were compared directly using both a comparison of proportion of concordant pairs and the kappa statistic. The clinical significance of the presence of discordant items, the likely reasons underlying discordant data, and the magnitude of the difference in result between methods were also considered [30,31]. To facilitate analysis we made some minor modifications to some of the definitions established by CIHI (for example looking at the individual components of composite measures, modifying time frames to match standard administrative data analysis algorithms). We did not assume that the chart was the "gold standard" for each item, using clinical experience and knowledge of the limitations of each data source to instead consider why the data sources did not always agree.

Ethics
Ethics review and approval was obtained from the Research Ethics Boards at Queen's University, The University of Ottawa, The Ottawa Hospital, and Sunnybrook Hospital (ICES). All study procedures underwent privacy review and approval at ICES.

Results
Seven teams, 41 physicians, 94 associated staff and 998 patients were successfully recruited. Results from one site that kept detailed logs revealed an overall patient participation rate of 90%. Fifteen to twenty patients per day were recruited by a team of 2 research assistants. Completed surveys were returned by 813 patients (81%), 38 physicians (93%) and 77 associated staff (83%). For the items in this paper for which survey data is presented valid responses were obtained from most subjects (86%-99%). Chart audits were successfully completed on all but one patient. There was over 95% agreement between chart abstractors on the sample of charts selected for validation. Linkage to administrative data was successful for 100% of participating patients and physicians.
The results of the physician, AHP, and practice surveys were not used to determine patient level outcomes and will be reported elsewhere. Table 2 outlines the sociodemographic characteristics of the study patient sample in relation to other patients from the same practices and patients from all the FHTs in Ontario. The table shows that the study participants included more female, older, sicker patients than those found in the same practices and in other Ontario FHTs. Tables 3, Table 4, Table 5, Table 6 and Table 7 present the results for preventive health interventions (Table 3), health promotion (Table 4), and Chronic Disease Management ( Table 6 and Table 7). For measurement of preventive health activities Table 2 shows that administrative data had both over 80% agreement with kappa statistics >.4 for mammography and BMD when compared to chart abstraction data. The kappa statistics were between .35-.45 and the levels of agreement lower (70-75%) for colorectal and cervical cancer screening, with a tendency for administrative data to underestimate rates of completion, while for immunization against influenza there was <60% agreement and a kappa of .25. The patient survey showed >75% agreement with the chart abstraction for mammography, BMD, cervical screening and clinical breast exam, but the concordance of the kappa statistic was only >.4 for BMD. There was less than 75% agreement for influenza immunization and colorectal screening with kappa values under .21 for both. Table 4 outlines the agreement levels observed in health promotion activities between the patient survey replies and the information found in their chart only as this information is not available through billing records. The only one that had concordance over 75% was current smoking status. Past smoking status showed less than 50% agreement and provision of advice on diet and exercise showed 70-75% concordance with kappa levels all <.2. The levels of agreement between the administrative and chart data as well as between the patient survey and the chart are shown for the presence of the index conditions of interest and use of two broad categories of medications are presented in Table 5. Table 6 supplements this with a comparison between the survey data and administrative data for medication use on patients over 65. There was >85% agreement on the presence of the diagnosis. For medication use, concordance between the chart data and patient surveys was >75%, while the concordance between administrative and chart data was >75% for antihypertensives and only 57% for antilipidemics, with much higher rates of lipid lowering agent use noted in the charts. In Table 6 the level of agreement between administrative data and the patient between 70-75%, which is slightly lower than level of agreement between the chart and the patient. The kappa statistic was higher for antihypertensives, than for antilipidemic drugs. Table 7 presents the comparison for a number of disease specific recommendations for the chronic diseases we examined, including a more detailed review of medication usage. Documentation of advice or resources provided has a level of agreement of <70%, with fewer than half of the events reported by patients being noted in the chart and a kappa <.4. For relatively less common but important key events such as MI and hospitalization, while agreement was >90% overall, there was poor agreement between the chart and patients who report having had these events, with less than half being identified in the charts. For the remaining process of care measures there was a mixed result, with levels of agreement being >75% for FBS, Lipid profiles, control of hyperlipidemia, AIC and foot exams and <75% for the remainder. Kappa values were <.4 for all process of care outcomes other than medications. The more detailed medication profiles showed levels of agreement >85% with kappas >.6 with the exception of ASA and Statins. ASA is an over the counter drug and despite a 78% agreement rate and kappa >.4, was only recorded in the chart in about 2/3 of patients who reported using it. In contrast, statins had only 54% agreement and a kappa <.2 due to patients reporting not taking them despite the drug being noted as active in their chart.

Discussion
There are relatively few studies in primary care performance measurement which have examined the differences in results obtained through different methods of measurement, especially involving administrative data [17,20,32,33]. This study examined the validity of data on primary care performance indicators obtained by various methods, as well as the acceptability, feasibility and potential biases of using a practice-based recruitment approach for the collection of linked data from a range of different methods. Our ability to collect data on multiple measures by audit, survey, and use of administrative data with good participation rates, high agreement rates between chart assessors, high rates of completion for surveys, and little objection to data linkage, shows that the collection of linked data from multiple sources is both acceptable and feasible. This study used sequential recruitment of patients presenting for care in participating practices [22]. Participants had 50-100% higher rates of chronic disease and *Resource Utilization Band, ** Adjusted Diagnosis Group Note: following the completion of this study a change in the methods used at ICES to identify patients allocated to physicians was introduced to correct for changes in the data received. A preliminary review of this table using the revised methods failed to demonstrate any significant differences from what is reported above. All study participants were located in urban centres other than Toronto. Rural FHTs (pop < 100,000) and FHTs from the greater Toronto area were therefore not included in this comparison.DM = Diabetes Mellitus, HTN = Hypertension MI = Myocardial Infarction CHF = Congestive Heart Failure.
multi-morbidity than the practice population, children were under-represented, and women and the elderly were overrepresented. Studies seeking a representative sample of all practice patients may wish to use other methods of recruitment,.Studies on the care experiences of chronic disease patients, workload or work process issues or the daily experiences of practitioners may find this to be both appropriate and efficient. As our study was focused on the concordance between data collection methods this design was unlikely to impact our results. Significant differences in performance were found using the different data collection methods for many indicators. No single data collection method emerged as consistently the most valid across all performance indicators. A only a limited number of indicators had kappa statistics >.4 (moderate or better agreement) [30] however in come cases these were much worse than the degree of concordance estimated by proportion of concordant pairs. When interpreting our data both kappa and the degree of concordance should be considered [31]. With the increasing use of administrative data for primary care performance reporting [19,34,35], remuneration and funding decisions [28], disease registries [27] and public health reporting [36] the comparison of administrative data results to multiple different data sources is especially important for guiding the use and interpretation of administrative data results in future research, policy, and planning [27,37,38].
Good agreement across measurement methods was seen for preventive health imaging such as mammograms and bone mineral density scans. Previous research has found similar mammogram screening rates using patient surveys and chart audits [17,32]. This study adds the use of administrative data to this comparison and finds it to be a reasonable alternative. Notable areas of discordance between administrative data and other *Comparing survey "ever" done to CA search back in chart for 2 yrs. **Comparing survey "ever "done to CA search back in chart for 5 yrs. Note that the total number of subjects for each item varies based on eligibility for the manoeuvre as well as completeness of the data set. Chart abstraction data had very few missing values as for most items missing was coded as "No". However, in the patient survey some individual questions or sections of the survey were skipped resulting in a reduced subsample for the comparison in a given question.
methods of data collection included Pap smears, colon cancer screening and influenza vaccination, all of which had much lower rates noted in the administrative data. Manoeuvres like Pap tests and colon cancer screening may be completed in contexts other than the delivery of primary care. Where services that are not included in routine administrative data sources, and recourse to them is more widespread administrative data may be less accurate in capturing care received by the patient than other methods. Thus, the local context of care may significantly alter the validity of administrative data results for primary care performance. Important differences between patient survey reports and chart audits were also found. Pap smears and influenza vaccination were reported at higher rates in the surveys. Patient over-reporting of Pap test rates has been previously reported [32]. Influenza vaccination in Ontario often occurs in public flu clinics and is therefore not always noted in the chart, so higher rates would be expected for this outcome. Patients who report having received diet and exercise advice had a notation to this effect in their chart over 90% of the time. The more significant difference was in the opposite direction, with 25-30% of patients whose charts indicated this was discussed failing to report this in the survey. This finding contradicts previous research comparing chart audit and patient survey to actual observation of patient visits, which found patients reported receiving advice more frequently than was noted in the chart [17]. This could be due in part to the multidisciplinary nature of these family health teams, where allied health providers would be contributing to the delivery of these services and documentation of this in the patient record. However, this finding may also reflect the difficulties in communicating messages on health promotion in ways that are memorable or retained by patients.
For most medications there was good agreement between chart and survey responses, and moderate agreement between survey and administrative data. There were two exceptions. For statins agreement between the chart and survey was poor, while there was much better agreement with administrative data. These discrepancies may represent issues of medication adherence, or poor understanding or recognition of a medication. Aspirin, which is available over the counter (OTC), was reported more frequently by patients than noted in the charts, likely representing lack of documentation in the chart. These findings suggest that patient survey may be the more accurate source for medications used, however this source is not without its potential biases. In terms of clinical outcomes, patients were more likely to report that their blood pressure and lipids were at target than was reported in the chart. These findings highlight the importance of importance of clear communication between patients, their physicians and other providers on issues such as medication use and management targets.
Recent efforts to assess the quality of chronic disease management (CDM) has relied on administrative data, EHR audits and population based surveys to identify the patient population with the condition of interest [36,38]. For identifying the population with diabetes, there was strong agreement across measurement methods. For hypertension, the level of agreement was still good, but not as much as for diabetes. For estimation of prevalence in the sample the margin of difference is fairly small (39% vs 36%), but there are large numbers of discordant pairs. In these situations, administrative data identify more cases than in the charts, perhaps a reflection of care that occurs in specialist or hospital settings and not captured in the primary care chart. For the purposes of registry generation, either method would be appropriate, with the cost of administrative data being much lower. Performance measurement in primary care increasingly forms the basis of quality improvement investments, performance bonuses, population health planning and reporting [15]. Others have expressed concerns about the unintended consequences of pay for performance systems and on the impact of performance measurement more generally on good clinical practice. [38,39] Our results indicate that careful consideration needs to be given to the methods used for assessing performance if these concerns are to be minimized. Future performance reporting should account for potential bias in results based on the data collection method and indicators measured.

Limitations
The study was carried out in only one region of Ontario, Canada, using a convenience sample of practices that included a large number of academic teaching practices. Many of these practices used hospital labs, reducing the ability of administrative data to capture those tests. The structure and quality of records may impact the results of the chart audit and may not be reflective of chart content in other locations. Ontario has extensive administrative data sets that have been cleaned for use in health services research studies and an extensive program of development of programming expertise and algorithm development that may not be available in all jurisdictions. Patients in Canada may also have different attitudes towards data privacy and linkage than others than those in other countries. We applied rules for eligibility for manouvers based on age, sex, presence of index conditions and also assessed for any exclusion criteria within existing guidelines, but did not attempt to determine reasons for non-completion (ie. Patient refused, deliberate deviation from guidelines due to co-morbidities etc. . .). In addition, due to concerns with patient recall about timing of specific manouvers we elected to use an "ever received" format for our survey questions. The potential bias introduced would increase the degree of discrepancy observed and may partially explain the lower degree of agreement seen between the patient survey and chart abstraction for colorectal screening, which for most patients in Ontario is conducted with an annual FOBT.
In any study there is the possibility of a Hawthorne effect, however in this study it is not likely as providers were not aware of which patients were participating and data collection was retrospective.

Conclusions
For many measures of technical quality of care, chart audit remains the most accurate method of data collection.   Patient surveys are required for more accurate assessment of indicators such as immunizations, chronic disease advice/information dispensed, medication use, and some general health promotion items. Consecutive sampling of patients in the waiting room samples a population that is sicker, older and more likely to be female compared to the practice population. Administrative data appears useful for a number of indicators including several aspects of screening and chronic disease diagnosis. Administrative data are much less costly than other methods of data collection and can cover entire populations. Recruitment rates of physicians and patients remained high while requesting permission to link the data collected at the practice to the provincial health administrative databases. A comprehensive understanding of primary care performance will require the use of multiple data collection methods for the foreseeable future. The choice of which methods are best for any one particular study or quality improvement initiative requires careful consideration of the biases that each method might introduce into the results. Future studies should also consider assessing the reasons underlying divergence between decisions made at the individual patient level and recommended guidelines.