The finding of more adverse events in the department with the best safety culture was unexpected, and questions the reliability and validity of the tools used for measuring the patient safety culture and the adverse events.
Except for the culture at the hospital level in department 1, the overall safety culture in the departments was satisfactory. In a database with results from 1128 hospitals and 567,703 hospital staff respondents reported by Agency for Healthcare Research and Quality, the mean positive response rates at the unit and hospital level were 64.7% and 58.3% respectively, and values outside ±5% were judged as statistically significant [28]. Compared with this database, the positive response rate at the hospital level in department 1 was unsatisfactory (37.1%) and the positive response rate at the unit level in department 2 was very good (71.5%). The low response rate in department 1 might have, for unknown reasons, selected participants who were critical to the culture.
The high response rates in department 2 and the participants’ favourable responses to the HSOPSC questionnaire might reflect the participants’ motivation for high-quality work and adherence to procedures and requests. They might unconsciously have given the “correct” answers. They also judged the patient safety as better and reported fewer adverse events than participants in department 1 despite the finding of more adverse events in department 2. The Kruger-Dunning effect described as “difficulties in recognizing one’s own incompetence lead to inflated self-assessments” could explain the inverse association between the culture and adverse events [29].
The prevalence of adverse events differed significantly between the departments. The GTT focuses mainly on adverse events in surgical departments and emergency settings. The evaluation for use in medical departments has not been equally good. Compared to the prevalence of adverse events published from other hospitals, which has been in the order of 4-17%, the prevalence in department 2 was unexpectedly high [1-3,30]. The estimates of adverse events with the GTT vary between analysing teams and depend probably on the patient record system [12,13]. In this study, one team analysed all patient records and the departments used the same electronic patient record system. The GTT retrieves only recorded adverse events. A higher awareness of adverse events in department 2 might have resulted in a better recording of minor events, which could explain the differences between the departments.
The safety culture is only one out of 20 factors mentioned as influencing clinical practice, and the association between the patient safety culture and adverse events seems to be marginal [10,16-18]. Studies and reviews conclude that research problems are related to definition and observation of adverse events, question the implication and generalizability of the results, and doubt the causal relationship between the culture and adverse events [16-18].
The psychometric properties of the tools for measuring patient safety culture and adverse events are of vital importance for the interpretation of the results, but not all psychometric properties of these questionnaires have been satisfactorily documented [14,31]. In addition, to extend their use outside the context (geographical region and healthcare system) in which they were developed demands new validations [15]. Criterion validity (the relation between the measurement and some other variable) and responsiveness (the ability to detect changes within groups) are important properties that have not been satisfactorily studied [14,31-33].
Patient safety (harm) and not “culture” is the most important criterion to be predicted by the patient safety culture surveys. Studies often report self-reported patient safety outcomes such as procedures and behaviour, and not independent measurements of adverse events [17,32,33]. This study demonstrated that the self-reported evaluation of patient safety differed from independently measured adverse events. The department with highest self-appraised patient safety had the highest prevalence of adverse events. The results indicate poor criterion validity of the measurement of patient safety culture. A review of psychometric properties of health-related questionnaires concluded that criterion validity was rarely reported [31]. Reviews of the psychometric properties of patient safety culture have reported no or only a moderate association between the culture and patient outcomes, and are uncertain about the causal relations and the responsiveness [14,17,18,33]. Studies claiming satisfactory criterion validity have used inappropriate criteria closely associated with measurement of the culture such as data collected by a questionnaire to the same personnel about working behaviour, involvement in safety activities, micro accidents, minor injuries, near-misses, compliance with safety rules and procedures, safety initiatives, safety compliance, safety participation, risk taking, rule breaking etc. [17,32]. In this study, the recording of the patient safety culture and the adverse events were completely independent of each other. The study indicates that comparisons of the patient safety culture across departments do not allow conclusions about differences in the “true” safety in the departments. This study and critical reading of the literature show that the criterion validity of surveys on patient safety culture is insufficiently documented for patient harm [34]. Therefore, surveys on the patient safety culture should not be used as proxies of the “true” patient safety until the criterion validity is better documented.
The GTT aims at measuring the prevalence of harm and changes over time [8]. The method has been judged as both appropriate and inappropriate for the purpose [11-13,30,35]. Most triggers are related to surgical procedures, and most evaluations have been performed in surgical and emergency units. The triggers in the Norwegian version of the GTT have never been evaluated for medical departments. The results will probably depend on the medical record system and the way events are recorded. Since the GTT never detects all adverse events and the proportion detected is unknown, the results do not indicate the true prevalence of adverse events. An important weakness is the large inter-rater variability. Studies have shown a variance in Cohen Kappa coefficients from 0.26 to 0.77 and in the prevalence of adverse events between the teams from 27.2. to 99.7 per 1000 hospital days, and that only 31% of adverse events were identified by two different teams [11-13,35]. The random error in these studies was large, and the sensitivity for detection of adverse events for a local team was 49% of the prevalence of an expert team [11,13]. Conclusions about the usability of the GTT vary enormously from recommendations to avoidance [11-13]. The results unveil major problems related to registration of adverse events, and demonstrate that the GTT probably is inappropriate for comparisons between units, departments, and hospitals and as an indicator of the true prevalence of adverse events. The GTT might be suitable for tracking changes in adverse events over time given that the measurements are performed in one single unit, by the same experienced team, with the same patient record system and a stable staff recording the events in the same way. This use of the GTT needs evaluation in studies with a focus on intra-rater reliability and responsiveness. The GTT is, nevertheless, better than self-reported measurements of adverse events [33].
Strengths and limitations
The rather small size of this study and the low response rate in one department reduce the reliability and render new and larger studies necessary. Valid information about associations between patient safety culture and adverse events requires studies with more participants in more than two departments, and the registration of adverse events over longer periods. Nevertheless, the unexpected result in this study calls attention to the lack of knowledge related to the measuring tools. It strengthens the study that the measurement of the culture and registration of adverse events were performed independently of each other, that one trained team performed all the GTT measurements, and that the departments were parts of the same hospital trust with the same patient record system and many common routines.
The number of patient records screened with the GTT was lower than planned in the protocol. Since the difference in adverse events between the departments was larger than presumed, this has probably not influenced significantly on the results.