Testing proposed quality measures for treatment of carpal tunnel syndrome: feasibility, magnitude of quality gaps, and reliability

Background The American Academy of Orthopaedic Surgeons and American Society for Surgery of the Hand recently proposed three quality measures for carpal tunnel syndrome (CTS): Measure 1 - Discouraging routine use of Magnetic resonance imaging (MRI) for diagnosis of CTS; Measure 2 - Discouraging the use of adjunctive surgical procedures during carpal tunnel release (CTR); and Measure 3 - Discouraging the routine use of occupational and/or physical therapy after CTR. The goal of this study were to 1) Assess the feasibility of using the specifications to calculate the measures in real-world healthcare data and identify aspects of the specifications that might be clarified or improved; 2) Determine if the measures identify important variation in treatment quality that justifies expending resources for their further development and implementation; 3) Assess the facility- and surgeon-level reliability of measures. Methods The measures were calculated using national data from the Veterans Health Administration (VA) Corporate Data Warehouse for three fiscal years (FY; 2016–18). Facility- and surgeon-level performance and reliability were examined. To expand the testing context, the measures were also tested using data from an academic medical center. Results The denominator of Measure 1 was 132,049 VA patients newly diagnosed with CTS. The denominators of Measures 2 and 3 were 20,813 CTRs received by VA patients. The median facility-level performances on the three measures were 96.5, 100, and 94.7%, respectively. Of 130 VA facilities, none had < 90% performance on Measure 1. Among 111 facilities that performed CTRs, only 1 facility had < 90% performance on Measure 2. In contrast, 21 facilities (18.9%) and 333 surgeons (17.8%) had lower than 90% performance on Measure 3 (Median facility- and surgeon-level reliability for Measure 3 were very high (0.95 and 0.96 respectively). Conclusions Measure 3 displayed adequate facility- and surgeon-level variability and reliability to justify its use for quality monitoring and improvement purposes. Measures 1 and 2 lacked quality gaps, suggesting they should not be implemented in VA and need to be tested in other healthcare settings. Opportunities exist to refine the specifications of Measure 3 to ensure that different organizations calculate the measure in the same way.

(Continued from previous page) Conclusions: Measure 3 displayed adequate facility-and surgeon-level variability and reliability to justify its use for quality monitoring and improvement purposes. Measures 1 and 2 lacked quality gaps, suggesting they should not be implemented in VA and need to be tested in other healthcare settings. Opportunities exist to refine the specifications of Measure 3 to ensure that different organizations calculate the measure in the same way.
Keywords: Quality measurement, Hand surgery, Carpal tunnel syndrome, Veterans, carpal tunnel release Background Carpal tunnel syndrome (CTS) involves compression of the median nerve at the wrist and can cause debilitating symptoms, such as pain, numbness or tingling in the fingers, loss of sleep, and thumb weakness [1]. Among employed adults in the United States (US), the prevalence of CTS has been estimated to be 7.8%, the incidence rate to be 2.3 per 100 person-years, and the total associated medical costs of $2 billion [2,3]. In addition to nonoperative treatments provided in diverse healthcare settings, surgeons in the US annually complete more than 500,000 carpal tunnel releases (CTRs) to treat this common and disabling syndrome [4].
In 2016, the American Academy of Orthopaedic Surgeons (AAOS) published the "Management of Carpal Tunnel Syndrome Evidence-Based Clinical Practice Guideline" [1], also endorsed by the American Society for Surgery of the Hand (ASSH) and the American College of Radiology. The multidisciplinary guideline panel examined dozens of diagnostic and treatment practices for CTS and used a rigorous and systematic process to evaluate the strength of evidence linking each practice to important clinical outcomes. These guidelines represent a broad-based consensus regarding practice standards for the diagnosis and treatment of CTS. However, for evidence-graded clinical practice guidelines to improve the quality of care, practice recommendations with strong evidentiary support need to be operationalized into feasible, reliable, and valid healthcare quality measures [5]. When healthcare quality measures are developed, pilot tested, validated, and implemented to monitor consensus practice standards, then clinicians, administrators, policy makers, and patients can use them for diverse purposes [6].
Recognizing the need to operationalize consensus standards of care for CTS into feasible and valid quality measures, the ASSH/AAOS Carpal Tunnel Quality Measures Workgroup was convened in January 2017 to review the clinical practice guidelines with the purpose of selecting quality concepts and drafting initial specifications of process-oriented quality measures [7]. The workgroup engaged in a modified Delphi process to judge 22 quality concepts with strong or moderate supporting evidence (3-4 stars on a 1-4 star scale) from the clinical practice guideline. Strong evidence (4 stars) was defined as "Evidence from two or more "High" quality studies with consistent findings for recommending for or against the intervention" [1]. Moderate evidence (3 Stars) was defined as "Evidence from two or more "Moderate" quality studies with consistent findings, or evidence from a single "High" quality study for recommending for or against the intervention" [1]. The workgroup then evaluated the quality concepts in terms of the National Quality Forum (NQF) criteria of importance, acceptability, feasibility, and usability. Most of the quality concepts were deemed infeasible to operationalize with commonly available data. More details of the workgroup's process and results are in a publicly available technical report [7]. Table 1 presents the 3 measures that were selected by the workgroup for development and specification [7]. Measure 1 ("Avoidance of MRIs for diagnosis of CTS") is motivated by evidence that MRIs are a poor rule-out test for CTS as compared to clinical examination or nerve conduction studies [8]. Measure 2 ("Avoidance of Adjunctive Procedures during CTR") is motivated by studies failing to find benefits of adjunctive procedures (i.e., internal neurolysis with operating microscope, radical nine-tendon flexor synovectomy, tenolysis) beyond the benefits of CTR alone [9][10][11][12][13][14][15]. Measure 3 ("Avoidance of Routine in-clinic OT/PT after CTR") is motivated by studies failing to find benefits of in-clinic occupational and/or physical therapy (OT/PT) modalities after CTR compared to home programs or placebo [16][17][18][19].
Once quality measures have been proposed and initial specifications have been developed, it is essential to pilot test them in real-world healthcare data at the intended levels of measurement (e.g., surgeons, facilities, healthcare systems). Pilot testing of quality measures has several goals including: (a) determining the feasibility of using the measure specifications and available data elements to calculate the measures in healthcare data from diverse settings, and relatedly, identifying aspects of the specification that might be improved or clarified [20]; (b) assessing if the distribution of quality measure performance is varied enough to justify measure implementation, such that sufficient opportunity for improvement exists; c) examining the measures' reliabilitymeaning their ability to distinguish real differences in performance between measurement units (aka signal-to-noise-ratio) [21].
Initial specification and pilot testing of Measures 1-3 was done in 2012-2014 Truven Marketscan data which precluded testing of International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) versions of the measures, and did not include facility or surgeon identifiers, making it impossible to assess if meaningful variation exists between the units to which the measures will be applied (i.e., surgeons and healthcare facilities) [7].
The goal of this study was to test if the three CTS quality measures selected by ASSH/AAOS Carpal Tunnel Quality Measures Workgroup might be useful for monitoring and improving the quality of CTS treatment. We conducted measure testing in data from the Veterans Health Administration (VA), the largest publicly-funded integrated healthcare system in the US, and secondarily in a private academic medical center. Specifically, we sought to 1) Identify aspects of the specifications that might be clarified or improved; 2) Determine if the measures identify variation in treatment quality that justifies expending resources for their further development and implementation; 3) Assess the facility-and surgeon-level reliability of measures.

Study design and setting
All three measures were operationalized according to the AAOS/ASSH specifications [7] (Table 1) using national data from the Veterans Health Administration (VA) Corporate Data Warehouse (CDW) for each of three fiscal years (FY; 2016-18), and as one three-year measurement period. The VA CDW contains all clinically and administratively generated data (e.g., encounter, procedure, diagnosis data) from all VA facilities nationally. Measure data were aggregated to the level of major VA medical facilities. Overall performance and facilitylevel variation in each measure's performance was assessed. The distribution of surgeon-level performance was also evaluated for Measures 2 and 3, which are focused on carpal tunnel release (CTR) surgery. Although no established standards exist regarding what constitutes an adequate quality gap and variability to justify implementation of quality measures, we describe the proportion of measurement units (i.e., facilities, surgeons) who fall below 90% performance. If all, or almost all, measurement units have better than 90% performance, we consider the quality gap, and opportunity for further quality improvement, to be low.
For measures with adequate variability, facility-and surgeon-level reliability analysis was conducted. Reliability in this context refers to how well a measure distinguishes real differences in quality between measurement units (signal) in the presence of measurement error (noise), and is characterized with beta-binomial signalto-noise ratio [21]. This is the standard measure of quality measure reliability when accountable entities, VA facilities and surgeons in this case, are being measured on the proportion of patients that meet some standard [20,21]. This signal-to-noise ratio ranges between 0 when all of the variability can be attributed to measurement error and 1 when all the variability is due to real differences in quality. In addition to calculating overall reliability, we examined the impact of various minimum  Numerator: Patients with a diagnosis of CTS who did not receive an ipsilateral wrist MRI to evaluate for carpal tunnel syndrome within 90 days before or after the diagnosis (CPT: 73218, 73219, 73220, 73221, 73222, 73223)

Strength of evidence from CPG: Moderate
Rationale: MRIs are a weak or poor a rule out test for CTS as compared to hand pain diagrams and nerve conduction studies [8].
Measure 2: Discouraging adjunctive surgical procedures during carpal tunnel release (CTR) Description: Percent of patients diagnosed with CTS who receive CTR who did not receive the following procedures at the same time: Internal neurolysis using operating microscope, radical nine-tendon flexor synovectomy, tenolysis of flexor or extensor tendon, forearm and/or wrist. Type: Process measure using claims data.

Strength of evidence from CPG: Moderate
Rationale: Moderate quality studies have failed to find benefits to CTR postoperative rehabilitation of in-clinic OT/PT modalities compared to home programs or placebo [16][17][18][19]. a Included in our calculation but not included in most recent technical report [4]. case numbers (e.g., 5 vs. 30 over the measurement period) on reliability estimates.
In order to test the measures in a different healthcare context, they were also calculated for 2014-2016 Stanford Health Care (SHC) data using the STAnford medicine Research data Repository (STARR), a clinical data warehouse containing live electronic medical records, including all diagnoses, procedures, and encounter-level patient data. Overall measure performance was calculated, but SHC surgeon identifiers were not available.

Measure 1 (avoidance of MRIs for diagnosis of CTS)
As summarized in Table 1, the denominator of the measure is all patients with a diagnosis of CTS during the measurement period. We added G56.03bilateral CTSto the denominator definition and suggest the technical manual should also address this omission. Although not clearly specified in the technical manual, we chose each patient's first diagnosis of CTS in the measurement period as the index encounter. The CTS diagnoses were included regardless of the order of diagnoses recorded for the encounter. The numerator was the number of patients who did not receive an upper extremity MRI in the 90 days before or after the index encounter.

Measure 2 (avoidance of adjunctive procedures during CTR)
The denominator of the measure is all patients receiving CTR during the measurement period. The numerator was the number of patients who did not receive at least one of the adjunctive treatments at the same time.

Measure 3 (avoidance of routine in-clinic OT/PT after CTR)
The denominator of the measure is all patients receiving CTR during the measurement period. The numerator was the number of patients who did not receive in-clinic postoperative hand physical therapy or occupational therapy (OT/PT) within 6 weeks of CTR. To address concerns that the OT/PT visits were for something unrelated to CTS or CTR, we also pilot tested a version of Measure 3 that required that the OT/PT visit include a CTS diagnosis.

Measure 1
Of 132,049 VA patients who were diagnosed with CTS in FY16-18, 5066 (3.8%) received an upper extremity MRI (nationwide Measure 1 performance = 96.2%). Performance on Measure 1 across 130 VA facilities ranged from 90 to 100% ( Table 2). The results for each year from FY16 to FY18 were very similar. At SHC, only 1.9% percent of 4304 patient diagnosed with CTS received an upper extremity MRI (Measure 1 performance = 98.1%).

Possible improvements and clarifications to the technical specifications
The technical manual states "This measure is to be reported at each denominator eligible visit occurring during the reporting period for patients with a diagnosis of carpal tunnel syndrome who are seen during the reporting period." The definition of "denominator eligible visit" is unclear. We assumed that each patient should have only one denominator eligible visit ("index encounter") per measurement period, corresponding to their first CTS diagnosis. If the intention is to focus on the initial diagnosis of CTS, then a diagnosis-free period before the index encounter should be required. The ICD-10-CM code G56.03 (Carpal tunnel syndrome, bilateral upper limbs) should be included in the specifications. Also, as some VA patients were diagnosed with CTS initially as inpatients, the specification should be specific about whether inpatient diagnoses should be included.

Measure 2
Of 20,813 VA patients who received carpal tunnel release in FY16-18, only 31 received an adjunctive procedure (nationwide Measure 2 performance = 99.9%; Table 3). Among 111 facilities and 1873 surgeons that performed CTRs, 1 facility and 10 surgeons had lower than 90% performance on Measure 2. The results for each year from FY16 to FY18 were very similar. In SHC, only 21 of the 640 patients receiving CTR had an adjunctive procedure (Measure 2 performance = 96.7%).

Possible improvements and clarifications to the technical specifications
The technical manual states the denominator should be the "number of patients who underwent carpal tunnel release" and also that "This measure is to be reported at each denominator eligible visit occurring during the reporting period." Therefore, it is unclear if the denominator should be limited to one (first?) CTR per patient or if all CTRs should be included. For pilot testing, we included the first CTR per patient in each reporting period. It also important to note that measure statement is intended to discourage "internal neurolysis using operating microscope", but code 64727 is not that specific: "Neuroplasty (Exploration, Neurolysis or Nerve Decompression) Procedures on the Extracranial Nerves, Peripheral Nerves, and Autonomic Nervous System". Similarly, the measure statement discourages "radical nine-tendon flexor synovectomy" but code 25115 is less specific "Radical excision of bursa, synovia of wrist, or forearm tendon sheaths (eg, tenosynovitis, fungus, Tbc, or other granulomas, rheumatoid arthritis)". Therefore, the measures are capturing a somewhat broader range of procedures than suggested by the measure description.

Possible improvements and clarifications to the technical specifications
The technical manual states "This measure is an inverse measurelower scores indicate higher quality". But the numerator definition includes those who do not get OT/ PT. Consideration should be given to requiring that the OT/PT encounter includes a CTS diagnosis. Measures 2 and 3 are based on CTRs, but there is no time frame specified for when CTR can happen after CTS diagnosis. For the majority of VA patients in our dataset, CTR occurs within 1 year after diagnosis, but for~20% CTR occurs within 2 years, and~5% after more than 2 years.

Discussion
For healthcare quality measures to be useful and have potential to drive improvement in care, they need to be clearly specified, feasible to implement with commonly available data at the intended level of aggregation (e.g., facility, surgeon), and reveal a compelling quality gap that needs to be addressed. In evaluating three recently proposed measures of CTS treatment quality in a national publicly-funded healthcare system and a single private academic healthcare system, we found several aspects of the specifications that can be clarified or improved. Revealing opportunities to improve measure specifications is among the most important reasons to test measures in real-world healthcare data. In VA, we also found that the measures were feasible to implement at the facility and surgeon levels. Although we could not obtain surgeon identifiers in SHC data for research purposes, these data would be available for operational and quality monitoring efforts. Careful pilot testing of quality measures also provides an opportunity to appreciate the inherent tension between accuracy and feasibility when measure developers try to operationalize clinical evidence with billing codes. As noted above, the codes used to capture "internal neurolysis using operating microscope" and "radical ninetendon flexor synovectomy" capture a somewhat broader  Weeks. The x-axis represents unique VA surgeons, so each dot represents the performance of a different surgeon. The surgeons have been sorted by performance so the distribution is more visually interpretable range of procedures. Also, because procedure codes do not carry information regarding the indication for the procedure, it is unknown if the MRIs captured by Measure 1 were for CTS diagnosis or something else. While this is a well-known, and to some extent unavoidable, limitation of quality measure design, it is always important to evaluate how much the decisions to enhance feasibility distort the underlying evidence and clinical practice guidelines. Some noise is acceptable as long as it is more or less randomly distributed among accountable entities (e.g., hospitals or surgeons).
Of the three proposed measures, only Measure 3 displayed adequate facility-and surgeon-level variability to justify its use for quality monitoring and improvement purposes, at least in the VA and SHC. The extent to which these results generalize to other settings is unknown. The overall facility-and surgeon-level reliability for Measure 3 was excellent. Limiting the measure to surgeons with some minimum number of cases (e.g., 5 or 30) in the measurement period improved the range of reliability.
The underlying rationale for Measure 3 is to avoid routine use of postoperative in-clinic, instead of unsupervised, OT/PT. As with many quality measures, a baseline rate of OT/PT use is expected and will be warranted in some cases (e.g. CTR in a patient with severe finger arthritis or a patient having substantial postoperative stiffness). A measure's usefulness comes from identifying facilities or surgeons whose routine practice is suboptimal given the underlying evidence and the performance of their peers. For example, Fig. 1 reveals two high volume facilities that provided OT/PT to almost half of their CTR patients. In addition to future refinements already mentioned, Measure 3 specifications should address rules for handling patients with CTS in conjunction with other hand conditions, CTS affecting both hands, and surgeons with low measure reliability due to very low CTR volume (e.g., < 5 CTRs per year).
At least in the healthcare systems studied here, Measures 1 and 2 did not identify significant quality gaps that need to be addressed, and may not be useful as quality measures. Its good news that these clinical practice guidelines are already being implemented to a high degree. It is possible that testing in other data sources and health care systems would reveal lower performance and/or more variability. These results also suggest that future quality measure workgroups should focus on processes of care that not only have strong evidence and are feasible to specify, but that also have more undesirable variability in practice. The purpose of quality measures is to drive improvement, which presupposes room to improve.
The most important limitation in our evaluation is the reliance on data from only two healthcare systems, making unknown the generalizability of these results. The main reason we supplemented our VA analysis with data from a private academic institution was to explore whether differences exist between public and privately funded systems. While the results were very similar, Stanford is only one institution. The prior testing of these measures in Truven Marketscan data was unable to examine variability at the facility and surgeon levels, however, the overall estimates in performance for Measures 1-3 were 99, 98, and 86% respectively -consistent with our results [7]. Nonetheless, more testing in different contexts may be warranted before reaching final judgements on Measures 1 and 2, or implementing Measure 3. Also, we were forced to make certain decisions where the written specifications were unclear. Although this was a major purpose of the study, it is also possible that handling the ambiguities differently may produce different results.

Conclusions
We found that Measure 3 (Discouraging the routine use of occupational and/or physical therapy after CTR surgery) displayed adequate facility-and surgeon-level variability and reliability to justify its use for monitoring and quality improvement purposes. Measures 1 and 2 displayed lack of a quality gap, suggesting they should not be implemented unless their importance and measurement characteristics can be demonstrated in other healthcare systems. Opportunities exist to refine the specifications of Measure 3 to ensure different organizations calculate the measure in the same way. The results of this study highlight the importance of testing proposed quality measures in real-world healthcare data before widespread implementation. accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.

Funding
This work was funded by grants from the VA HSR&D Service (RCS14-232; AHSH) and support from the Stanford University Department of Surgery. The views expressed do not reflect those of the US Department of Veterans Affairs (VA) or other institutions. These institutions had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Availability of data and materials
The datasets generated and/or analyzed during the current study were derived from the VA Corporate Data Warehouse and STAnford medicine Research data Repository, and are not publicly available due to identifying nature of patients and providers. However, how data was collected and managed can be shared via the corresponding author on reasonable request.

Ethics approval and consent to participate
This work was granted a waiver of informed consent and was approved by the institutional review board of Stanford University and the research & development committee of VA Palo Alto.

Consent for publication
Not applicable.