Validation of an instrument for measuring satisfaction of patients undergoing hemodialysis

Background Patients’ satisfaction is an indicator of the quality of healthcare services. Its measurement involves developing and validating complex instruments. The purpose of this study was to validate a scale for measuring hemodialysis patients’ satisfaction with the provided care, the Scale for Evaluation of Hemodialysis Patient’s Satisfaction with Service provided at a Chronic Kidney Disease Unit (or ESUR-HD, its acronym in Spanish). Methods The instrument was applied to 370 patients undergoing hemodialysis for undertaking exploratory and confirmatory analyses, internal consistency assessment, and Rasch analysis. In order to assure test-retest reliability, the instrument was applied once again to 54 patients after 2 days. Convergent validity was assessed by estimating correlation coefficients based on the results of 2 instruments (ESUR-HD and SDIALOR) simultaneously applied in 70 patients. Sensitivity to change was assessed in 40 patients by comparing the scale scores before and after an intervention consisting of improved care conditions. Results In the 44 items of the scale a 9-factor structure was found (1: Facilities and organization of the service. 2: Care provided by the attending nurses and/or nursing assistants. 3: Attention to psychological and administrative issues. 4: Contact and social work personnel. 5: Medical attention and care. 6: Nutritional attention and care. 7: Medications supply and quality. 8: Features of the admission process. 9: Attention and care provided by head nurses). Chronbach alpha for the scale was 0.96. Lin’s concordance correlation for the whole scale was 0.85. Although statistically different from 0, low correlation values with dimensions from another scale measuring the same attribute were found. The scale could detect construct changes through increased scores in specific dimensions following an intervention aimed at enhancing satisfaction. Rasch analysis located improperly fit items and suggested reducing items measurement levels. Despite the effect encountered, Rasch analysis showed the scale might not capture variability in upper attribute levels. Conclusion The ESUR-HD scale measures hemodialysis patients’ satisfaction in one dimension with 9 domains. Validity and reliability are adequate. The instrument may detect changes in the construct. Subsequent versions of the scale should include new items allowing improved discrimination amongst high satisfaction levels. Trial registration 10.1186/ISRCTN45318400. April 05, 2017


Background
Satisfaction with a healthcare service has been defined as the quality of an offered service as perceived by the patient, and is a performance indicator of healthcare organizations [1]. Such satisfaction is a top consideration when measuring healthcare and services to fulfill patients' expectations and values [2]. It has been proposed that any quality evaluation of healthcare services include a patient's satisfaction, instead of being restricted to conventional indicators such as morbidity and mortality [3]. Satisfaction is a complex concept which depends on an individual patient's characteristics (e.g., lifestyle, previous healthcare experiences, values), as well as on social characteristics, particular disease issues, and healthcare services (followup, treatment adherence, health services stability) [4].
Healthcare quality is an increasingly important issue in medicine [5][6][7][8], especially regarding chronic conditions, as is the case of end-stage kidney disease. It has been seen that a patient's satisfaction is associated with adherence to therapy (i.e., increased satisfaction leads to improved adherence) [9].
Concerning kidney disease, quality improvement regards not just dialysis therapy but also related products and services [10]. Amongst such services, those associated with psychosocial issues are particularly important, as it has been shown that outcomes such as mortality are associated with depression, lack of psychosocial support, and patients' perceptions about their disease [11][12][13]. Patient satisfaction with chronic kidney care and caregivers, it has been said, also relates to quality of life as perceived by the patient [9,11,14].
Another remarkable aspect of a disease of this kind is that as a result of the long-term and technical peculiarities of the therapy, the patients and the treating team build relationships that are usually close and lasting [13]. It was not long ago that above-mentioned peculiarities of dialysis therapy were included in the tools for measuring patient's satisfaction with care provided [15][16][17].
Some research into peritoneal dialysis services have shown that a patient's satisfaction is associated to the depth of information offered by the treating team, the compassion with which the service is provided, how efficient the dialysis elements supply is, and the presence of a nurse [16]. It has also been described that patients undergoing peritoneal dialysis show higher degrees of satisfaction than those undergoing hemodialysis and that their satisfaction could be improved by offering them information about potential adverse therapy events [18] and about peritoneal dialysis as an option [19]. A study reported that a negative perception of the treating nephrologist is associated with poor therapy adherence [15].
Despite the relevance of the subject, not many validated instruments are available for evaluating satisfaction among kidney disease patients undergoing dialysis therapy. The Choices for Healthy Outcomes in Caring for End-Stage Renal Disease (CHOICE) [17] is an instrument that has been used for comparing satisfaction with the type of dialysis therapy [19], and as the basis for the development of other instruments. Other instruments for evaluating satisfaction in patients undergoing dialysis are the Satisfaction of Patients in Chronic Dialysis (SEQUS) [18], the SDIALOR (Satisfaction des patients dialysés en Lorraine) [1], the Client Satisfaction Questionnaire (CSQ) [20], the Customer Quality Index (CQ-index), the Renal Treatment Satisfaction Questionnaire (RTSQ) [21] and the Consumer Assessment of Healthcare Providers and Systems In-Center Hemodialysis (CAHPS-ICH) survey [22].
Based on what has been stated so far, it may be argued that measuring a patient's satisfaction: [1] is an essential element for evaluating the quality of healthcare services; [2] may be used as an institutional performance indicator; and [3] is related to a patient's quality of life and adherence to therapy. Until recently, it has not been possible to measure this construct in Colombia by means of instruments with known psychometric properties. Thus, validating a questionnaire allowing the assessment of hemodialysis patients' satisfaction in a valid and reliable way in Colombia is deemed important and is the objective of this study.

Methods
The 44 items scale for evaluating satisfaction with the service offered at a hemodialysis unit to chronic kidney disease patients (ESUR-HD) was initially developed by a group of nephrologists, nurses, and patients. Following a review of the literature, potential variables or dimensions associated with chronic kidney patients' satisfaction were identified.
Four focus groups -each with 3 nurses, a nephrologist and two persons of the administrative area -were done in four different regions of the country. They asked what were the main aspects that could influence a patient's satisfaction. The backbone of the evaluation were the processes and procedures of clinical care in hemodialysis of Renal Therapy Services.
Through the focus groups, the following dimensions were defined: overall satisfaction with the services (3 items), personnel at the unit (24 items), medications and supplies (4 items), facilities and processes (13 items), and phone contact (6 items). The instrument was designed as a phone survey. Following a preliminary trial, its initial structure was modified by removing 6 items because of redundancy or poor relevance. This has been the only available version of the scale and is the one used in this validation. Answer options for each of the 44 items are rated 1 to 5 by means of a Likert scale ranging from "Very unsatisfied" to "Very satisfied." A final score is obtained by non-weighted sum of the score given to each item; accordingly and as a result of items structure, higher scores reflect increased patient satisfaction. The time it takes to complete the instrument is 15 min (median time).
The instrument was applied to a sample of patients (n = 370) undergoing therapy at a hemodialysis program during 2013; each patient was asking by telephone about their willingness to answer the survey and 6% of them refused to answer. Such sample was used to perform an exploratory factorial analysis, a structural-equation confirmatory analysis, internal consistency, and Rasch analysis. To evaluate convergent validity, the SDIALOR scale [1] was simultaneously applied in a subgroup of patients (n = 70) of the initial sample; this scale consists of 7 domains (organization of medical care, relationship between nephrologists and general practitioner, locational characteristics, accessibility, care provided by the health personnel, information provided by the doctor, problem solving, overall satisfaction) and shows levels of internal consistency above 0.7 in different domains. It was used because it is the only cross-culturally adapted instrument to measure patient satisfaction with available care for renal disease in Colombia [23]. Test-retest reliability was also evaluated by applying again the ESUR-HD scale 2 days after the initial assessment in a subgroup (n = 54) of the 370 patients. This time period was used considering the scale length and the recommendation of some authors on applications to assess test-retest reliability [24]. In order to establish the sensitivity to change, the scale was applied to 40 patients before and after an intervention. In other words, the patients were evaluated to measure their satisfaction in a hemodialysis center, and then were re-evaluated one (1) month after being transferred to a new renal clinic within a hospital -with remodeled spaces, waiting rooms, hemodialysis equipment with newer technology along with notably better prepared healthcare personnel more familiar, expert and dedicated to the patient's care.
The data from the study and the full instrument may be required from the principal investigator Mauricio Sanabria: mauricio_sanabria@baxter.com.

Statistical analyses
Considering that the latent dimensions structure of the instrument was purely theoretical, an exploratory factorial analysis was carried out, taking into account the ordinal nature of the variables (each item being rated on a Likerttype scale), using a minimal residues factorization method on a polychoric correlation matrix. The parallel analysis method [22,25] was applied for determining the number of factors. An orthogonal rotation (Varimax) was used to improve factors interpretability. Structural equations from polychoric correlation matrices and asymptotic covariance matrices were used for the confirmatory factorial analysis (which was done considering the ordinal nature of the items' qualification). As an estimation method, diagonally weighted least squares were used, assuming no normal data distribution. Data fit was assessed for 2 model types: one guided by the exploratory factorial analysis and one suggested by the changes in modification indexes. Criteria for considering whether the models fit was adequate were as follows [23,26]: Ratio of Χ 2 to degrees of freedom ( X 2 df ) < 3, Tucker-Lewis index (TLI) and comparative fit index (CFI) > 0.9, and root mean square error of approximation (RMSEA) < 0.8. In addition, both Bayesian information criteria (BIC) and Akaike information criteria (AIC) were calculated, lower values suggesting better model fit. For estimating the sample size for the factorial analyses with this type of covariance structures, the recommendation of having at least 250 observations was taken into account [24,27].
To evaluate the internal consistency of the scales, factors, and items, Cronbach alpha was calculated for the whole scale as well as for each domain suggested by the factorial analysis and for the scale deleting each of the items. A sample of 257 subjects, each answering 44 items, would allow 90% strength to detect a 0.6 difference between an alpha coefficient for the nil hypothesis and at least of 0.7 for the alternative hypothesis, using a 2-tail hypothesis and a 5% significance level [25,28].
For the assessment of test-retest reliability, means of the two (2) measurements were compared using the signed-rank test. In addition, Lin's concordance correlation coefficient was estimated using the values of two (2) repeated measures from each subject. A 54-subject sample allows detecting a difference between coefficients 0.7 (nil hypothesis) and 0.85 (alternative hypothesis) with a 5% significance level and 80% strength [26,29].
Convergent validity was evaluated by calculating Spearman correlation coefficients. A 70-subject sample size is adequate, considering values of at least 0.8 with a 95% confidence interval and a ±10 precision around the estimator.
To assess sensitivity to change, scores corresponding to repeated measurements were compared by using paired-t tests and a 5% significance level for the 2-tail hypothesis. For sample size estimation, an at least 10point pre-and post-intervention difference with a 20point standard deviation, a 5% significance level, and 80% strength were assumed; with such assumptions, 40 subjects were required.
Through Rasch analysis, the following aspects were evaluated [27,28,30,31]: reliability indexes for persons and items (values ranging between 0 and 1); separation indexes (values ≥ 2 indicate proper separation); item-fit statistics (infit and outfit statistics). Items with infit-outfit > 1.4 and corresponding ZSD values > 2 are considered improperly fit; infit-outfit < 0.6 suggest item redundancy. For the rating scale diagnosis, means, outfitinfit mean squares, and step measures were estimated. Persons-items map distribution was also assessed. For the sample size in Rasch analyses, the recommendation of having at least 250 subjects when using Likert-type scales was followed [29,32].
Confirmatory factorial analyses were done with the Stata® program; remaining statistical analyses were performed by means of the R program. The trial was carried out according to ethical considerations from the Helsinki Declaration and was approved by an institutional ethics committee. All of the patients gave their informed consent for participating in the trial, in a verbal form.

Exploratory factorial analysis
Two-hundred and eight (56.2%) of 370 surveyed patients were males. Mean age was 57.9 years (SD 16.5). All of the patients were in a hemodialysis program; in the sample were included patients treated at facilities in all regions of the country: 236 from the central region (63.8%), 55 from the southwest (14.9%), 44 from the northwest (11.9%), 23 from the Caribbean coast (6.2%), seven (7) from the southeast (1.9%) and five (5) from the northeast (1.4%). Patients had a median time spent in renal replacement therapy of 3.4 years (interquartile range = 5.1 years). The main causes of renal disease were diabetes (34.3%, N = 127), hypertension (23.8%, N = 88) and glomerulonephritis (11.1%, N = 41). In 13.2% of patients (N = 49) the cause of kidney disease was unknown. The percentage of patients with Karnofsky scale <50 was 28.8% (N = 77). The median Charlson score was 6 (interquartile range = 6).
According to parallel analysis results, the optimal number of factors to analyze was 9.
Factorial structure showing best interpretability was that of orthogonal rotation Table 1.
As it may be seen, one of the items ("Quality of the snack supplied at the renal unit") has no adequate load values in any of the domains. Variance ratio for each factor was as follows:

Confirmatory factorial analysis
Goodness-of-fit indicators were calculated for two (2) models: one corresponding to the first-order factorial structure presented in Table 1 and another following removal of the item "Quality of the snack supplied at the renal unit" and incorporation of covariances between some of the items, according to modification indexes outcomes; such indicators are presented in Table 2.
Despite these indicators, outcomes were similar for both models, thus suggesting an acceptable structure fit with 9 domains, CFI and TLI values are closer to 0.9, and RMSEA values, as well as information criteria, are lower than in model 2. The model structure with the best fit (i.e., model 2) is depicted in Fig. 1.

Validity of convergent criteria
Results for correlation coefficients between the two applied scales (SDIALOR and ESUR-HD) are shown in Table 3, where it may be seen that correlations among the two scales' domains reach low values (maximal being 0.33). However, all theoretical correlations have a plus sign and one of the highest values corresponds to the domain pair regarding medical care (r = 0.33). The domain with the largest number of correlations significantly different from 0 is the one regarding the admission process (correlated with domains 1, 3, 5, 6, and 7).

Test-retest reliability
Mean time elapsed between the two (2) measurements in 54 patients was two days. Means obtained initially were similar to those obtained in the second measurement Table 4. There was no significant difference in any mean pair (signed-rank test, p > 0.05). Values for the concordance correlation coefficient for the scale were 0.85. When evaluating reliability within different instrument domains, low values were found for two (2) of them Table 4: domain 5 (medical personnel) and domain 7 (supplied medications).

Sensitivity to change
Mean scores before and after the intervention (change of renal unit) corresponding to every scale domain, are presented in Table 5. For the 40 patients experiencing such intervention, differences turned out significant in the

Rasch analysis
Information about overall model fit is shown in Table 6. SD values from ZSTD > 2 suggest the presence of improperly fit items. Reliability indexes and those corresponding to persons and items separation for each of the domains are presented in Table 7. Reliability values > 0.57 and > 0.58 were found for items and persons, respectively. Modest separation indexes were found for persons but indexes were proper for items, which suggest restricted amplitude of the attribute in this patients' sample. Table 8 shows the fit statistics by weighted (infit) information criterion and extreme values (outfit) criterion for the items of the scale; it may be seen that four (4) of the items show an improper fit ("Snack", "Contact with the administrator", "Full medication delivery", and "Easiness for phone communication").
Average scores Table 9, which are a mean value for the differences between the item's ability and difficulty values, show an ascending monotonic trend in each of the domains, except for domains 1 and 9. This suggests that, except for those two domains, patients with the highest levels of satisfaction tend to grant the highest ratings to each item. This is consistent with the finding of fit values by weighted information criterion (infit) and extreme values criterion (outfit) out of the 0.6-1.4 range in the initial categories of such domains items. The presence of fit values that are not close to 1, especially for domains 1 and 9, suggests that people with high levels of satisfaction unexpectedly tend to give low ratings to such domains.
Probability curves for each item measuring category are shown in Fig. 2, grouped by domains; it may be seen that in 5 domains from category 2 (corresponding to the "unsatisfied" category in the Likert Scale) provides no clear discrimination of the underlying feature and might be disregarded.
The higher a patient is in the vertical scale Fig. 3, the higher is the degree of satisfaction. It may be seen that there is a group of patients with high levels of the attribute as well as an important dispersion in the measurements, especially for patients (range:−0.5 to 6 logits). It is also seen that means for items and persons (patients) are about 2 logits away, indicating that the latent feature presented by this group exceeds what may be measured by the scale (the map also reveals a ceiling effect). In addition, there are a couple of items (P3.12.1 and P3.12.3), which do not appear to adequately measure the attribute measured by other items. Other elements highlighted in the map are the strong marker items for the feature (P3_7_2 and P3_8_ in the upper part of the map) and the weak markers (P3_15_4). Distance between items P3_12_1 and P3_12_3 is consistent with their poor fit indicators.

Discussion
Satisfaction with a dialysis service is a multidimensional attribute that in the ESUR-HD scale appears as a 9-factor or domains structure, adequately reflecting the underlying construct. Instruments designed for other clinical settings or different cultural environments focus on certain aspects or include elements that may not be applied in every culture. For example, the SDIALOR questionnaire assesses the interaction between the general practitioner and the nephrologist, an element that does not apply in many dialysis services in Colombia. On the other hand, such a questionnaire encompasses in just one domain what is related with the involvement of other healthcare professionals (dietician, social worker and psychologist), while in ESUR-HD, five (5) domains refer to this issue. Essur-HD is an instrument that can be employed by telephone, has a similar number of items than other instruments measuring the same construct, takes little time and can qualify in a simple way (only make summations of items without having to resort to complex transformations or algorithms for qualification). Findings resulting from analyses undertaken to evaluate content validity suggest that the multidimensional structure named as "satisfaction" must be measured using instruments that are adequate for cultural particularities and specific setting services. This would render questionable the universal use of an instrument for measuring satisfaction with dialysis services in different countries.
Despite the fact that the ESUR-HD scale showed proper internal consistency, which suggests an adequate instrument reliability, such a finding should be taken with caution as the Cronback alpha coefficient tends to increase along with the number of items of an instrument (44 items in ESUR-HD). Consequently, this finding is to be analyzed considering other reliability indicators, such as those resulting from the theory response approach discussed later.
Regarding convergent validity (measured through the simultaneous application of two (2) instruments aimed at measuring the same construct), it was found that some scores for the scale domains are positively correlated with scores from other instrument domains measuring the same construct (i.e., SLADIOR). This could favor the fact that the instrument has adequate concurrent validity; however, overall correlation values were low (the highest one being 0.33), and no correlation between apparently equivalent domains was found. For example, despite there being a positive, significantly different from 0 correlation between the "Medical care" domain in SLADIOR and "Medical attention" in ESUR-HD (r = 0.33), for domains "Facility and environment" in SLADIOR and "Facilities and service organization" in ESUR-HD, the correlation was 0.16 (which is not significantly different from 0). There was also a positive correlation different from 0 between the "Paramedic care" domain in SDIALOR and "Head nurse attention" in ESUR-HD (r = 0.26), but no significantly different from 0 correlation between "Paramedical care" in SDIALOR and "Nursing assistants care" or "Nutrition care" IN ESUR-HD was found. These findings correspond to low convergent validity and could simply reflect different latent variables structures in the two questionnaires or either that the two instruments are measuring the attribute from perspectives which are not precisely coincident (it must be borne in mind that SDIALOR is a more general instrument, as it also includes issues related to peritoneal dialysis). The described findings would favor the fact of satisfaction being a construct strongly influenced by cultural and local service particularities; another possible explanation for this finding is that the instrument SDIALOR not possess adequate psychometric properties when used in Colombia. Although it is the only instrument that has been cross-culturally adapted to measure satisfaction in renal patients in this country, this does not guarantee that it has proper validity and reliability for measuring a complex attribute, as with satisfaction; this means that in future studies on the psychometric properties of Essur-HD we should consider to evaluate the validity of the instrument using other scales with recognized measuring qualities [33].
Stability of scale scores in repeated measurements and the finding of a 0.85 concordance correlation coefficient under construct stability, also favor a proper reliability on the assessed instrument. These findings suggest that the overall variation of the instrument is mainly explained by the real variability of the construct being evaluated (patients' satisfaction), and not so by error. Anyway, it  is possible that the time between the two measurements (2 days) has been too short, and that the patients rather to respond an item de novo, had placed the value remembered of the first application. Low correlation values of the domains regarding medical care and supply and quality of medications may suggest these are less stable elements in the process of attention of hemodialysis patients. The design used for evaluating the instrument's ability to detect changes showed that scale scores were increased in most domains. This finding is consistent with the intervention performed: offering a group of patients the service in an enhanced facility with personnel changes, which implied a better service. Findings of differential changes depending on the domain (there were significant differences in total scores and in scores regarding facilities and service organization as well as with medical care, contact and social work personnel), suggest instrument scoring must be done considering the latent variables structure, as this strategy detects more specific change levels.
According to findings related with the Theory of item response (Rasch model), the sample of patients used for validating the scale rendered very high levels to the feature (there is a 2-logit difference between the means for item difficulty and patients ability; a ceiling effect of the measurement may be argued). This is consistent with the finding of low persons separation indexes as compared to items separation indexes. Rasch analysis findings also suggest that despite the fact the instrument may appropriately measure attribute levels in patients with lower degrees of satisfaction, in patients with attribute levels as high as those found in the sample, it might not discriminate adequately different attribute grades. In order for the instrument to have this property, including additional items would be required; doing so would demand a qualitative approach by including patients and other people associated with healthcare services. Previous studies have also had difficulties regarding a ceiling effect when measuring satisfaction among this kind of patients [20,34]. In such cases, strategies such as increasing each item's answer options and score normalization have been used [30]; however, using qualitative approaches to evaluate these constructs in patients reporting optimal experiences has also been proposed [20]. As a result of the findings from Rasch analysis in our study, we consider the most appropriate approach for improving the instrument might be incorporating other items that cover in a more convenient way the sample of renal replacement therapy user patients. Another finding from Rasch analysis is associated with the improper fit items: worst fit statistics were those regarding items "Quality of the snack offered at the renal unit" and "Contact with the unit administrator". Despite the item regarding quality of the snack provided to patients results relevant in other instruments (SEQUS, SDIAOR), according to results from Although other improper fit items were detected, removing them from the instrument was not considered since the classical measurement theory approach did not diagnose them as problematic (resourcing to Rasch analysis results for removing an item from an instrument has not been recommended) [31,35]. The scale has no redundant items (items with a proper fit, measuring the same attribute in a similar way). Items best representing the underlying dimensions (i.e., obtaining higher scores probably reflect high levels of satisfaction) are "Timely medications delivery date" and "Supplies quality and reliability." On the other hand, the item evaluating the "Easiness for phone communication with the renal clinic" is a weak marker of the attribute (even low satisfied patients may grant it a high score). Regarding measurement scale of the items, it was found that it discriminates adequately among different intensity levels of the attribute, but may be restricted by suppressing the "unsatisfied" option, as in several domains this category does not properly discriminate the attribute intensity. We note the following limitations of our study: 1. The ceiling effect has an impact on the ability of the instrument to differentiate patients with high levels of the attribute. This can be problematic as far as assessing sensitivity to change, given the weight that this scenario would have on the phenomenon of regression to the mean. 2. The time taken for evaluating the reliability testretest may have favored the finding of high levels of correlation, which may not necessarily reflect the reliability of the construct.
3. Concurrent validity could be affected by the use of an instrument whose psychometric properties are not clearly known in Colombia.

Conclusion
According to the results of the present study, the ESUR-HD scale measures patients' satisfaction with hemodialysis therapy as if it was a 9-domain construct. The 44-item version includes a measuring scale that must be adjusted by removing the "unsatisfied" category and deleting 2 items showing an improper fit ("Quality of the snack offered at the renal unit" and "Contact with the unit administrator"). The instrument showed acceptable validity and reliability; in addition, it was able to detect in the construct changes following an intervention that improved the patients' satisfaction. Using an items measurement scale with just 4 categories allows adequate detection of different attribute levels. Including new items allowing improved discrimination between high satisfaction levels in subsequent scale versions is recommended.