In this study, we have shown that the reproducibility (Kappa) of the presence of any trigger present in deceased patients in the hospital was 0.5 (95%CI 0.34–0.66). The average Kappa of individual triggers was 0.42 (range 0–0.78). Our average Kappa of 0.5 (moderate agreement according to Landis et al), appears to be slightly lower than results found in other studies, where a range between 0.49–0.76 was reported [3, 4, 21, 25,26,27,28]. However, compared to three Dutch reports which included results of screening for triggers in a sample of cases in 21 hospitals, our Kappa was in the same range [3, 4, 21]. Naessens (2010) and Ock (2015) evaluated the inter-rater reliability for individual triggers selected either from the HMPS study or the IHI trigger system or both. Four of the triggers investigated by Naessens et al., were comparable to our triggers. Half of these had a higher Kappa agreement in our study compared to Naessens et al. Two out of the three comparable triggers in the study of Ock et al. (2015) performed better in their study compared to ours [29, 30]. However, again, the population here was sampled from living non-pediatric inpatients.
Concerning the average observed agreement for individual triggers, Unbeck et al. (2014) reported that the reproducibility of the individual triggers was on average 46%, in comparison with a 67% reproducibility in our study. The total agreement for any trigger present was 65.0% compared to 90% in our study [31]. Regrettably, Unbeck et al. didn’t report the performance of the triggers on an individual level and studied only living pediatric inpatients. Therefore, this is the first study investigating the performance of the individual triggers of the HMPS trigger system solely in deceased patients. Not surprisingly, objective triggers were more reproducible than subjective triggers.
The Kappa coefficient is influenced by the prevalence of the condition and by bias. Therefore, we also calculated the PABAK. This improved the reliability score, resulting in moderate substantial to almost perfect inter-tester reliability for the individual triggers. An exception was the trigger concerning other patient complications which showed almost no improvement with an end result still well below moderate reliability. The outcomes of these calculations suggest that the low value of Kappa was influenced mainly by the low prevalence of triggers.
Obviously, the performance of the trigger system is important. It should not miss records with serious and potentially preventable AEs and preferably not select any records without AEs. Because trigger systems are used as an aid to reduce the burden of scrutinizing all records, it implies that important AEs could be missed. The fact that new cases with triggers were found in the second round supports this idea.
Due to our random sample of records, we believe that the calculation of the estimated sensitivity and specificity approaches reality. Therefore, when we apply these values found in this study to the entire population of deceased patients in our hospital the false negative rate would be 8%.
The high sensitivity of the system to find cases with an AE was rather comforting. In contrast to a high sensitivity, the specificity of this trigger system was rather low (58%) compared to most of the other studies [22, 32,33,34,35]. This results in a substantial number of cases that have to be scrutinized without finding an AE. However, equal results were presented by Howard et al. (2017) and our results were slightly better in comparison to Neubert et al. (2006), Eggleton et al. (2014) and Matlow et al. (2011) [36,37,38,39].
The variability in triggered cases with a low Kappa suggests unfavourable characteristics of this system. This possibly results in considerable useless time-consuming scrutinizing of records by expensive specialists. Solutions to improve efficacy could be the use of more reproducible triggers (such as the objective ones), combining triggers with patient characteristics, or fully computerized trigger detection by “data mining” software [23]. Before implementing such adaptations, we suggest thorough research concerning the exact performance and costs for finding preventable AEs. However, at the moment there are no better systems available for case selection.
Among the strengths of our study is the fact that the nurses were blinded to the results of the first trigger round. Furthermore, in our system, there were no time limitations while searching for triggers. We, therefore, assume that cases were investigated thoroughly and complete which makes the possibility of a missed trigger as low as possible.
A disadvantage of our study is the small randomly selected sample of all cases that were screened previously. However, this sample was strong enough to detect the real proportion of triggers. Another issue could be the selection of deceased patients. Some studies report that a focus on deaths may not be the most efficient approach or an unsuitable indicator to compare the quality of hospitals [11, 40]. Yet, mortality is the event caretakers and patients want to prevent. Of course, departments with low mortality or those who treat non-life-threatening diseases, such as ENT and dermatology, will rarely hear about their AEs from this type of medical record review. As several studies show, AEs don’t have to result in death [2, 26, 41]. They can also cause temporary or permanent injury. Triggers are indications for all AEs, not necessarily for those who cause death. Hence, another chart review system could be more applicable to those departments. Finally, trigger 1 was changed during the time course in which we selected cases for this study. This could have potentially influenced the results. However, trigger 1 was found more often in the second screening round where we expected less often because we shortened the time period making it positive. Therefore, we do not think this influenced our results materially.
Interestingly, more triggers were found during the second round, especially trigger 15 was significantly more present being responsible for most of the difference. In our opinion, supported by a p-value < 0.0001 resulting from the McNemar’s chi-square test, this cannot be attributed to chance alone. We suspect that extra attention among the nurses due to the fact that the second round of review was part of a study might have contributed to this. Furthermore, one could suspect an increase in experience although our team of nurses was deployed for many years in a stable team and cases were selected from a recent period. However, some of the cases that were triggered the first time were not found in the second round. Although memorizing the results in a specific case could have given rise to bias, we found only 12 cases that were checked by the same nurse. Excluding these 12 cases did not influence the results significantly.
We realise that we only analysed a small part of the complete process of looking back at our proceedings, determine essential parts, develop new solutions and applying them in future care. Moreover, there is no information about the performance of this trigger system in improving health care. However, we think it is important to increase knowledge about these components to optimise care in the end.