Studies using administrative data rely on the accuracy of the ICD-9 diagnostic codes. This study was conducted to validate the administrative diagnoses of four key behavioural variables often used in mental health and health services research by comparing them to the presence of the corresponding conditions in chart notation.
Overall agreement and specificity were generally high across all behavioural variables. Sensitivity, however, was substantially lower than optimal (< 75%) for all four variables, and was particularly low for suicide attempt. Sensitivity was consistently highest in the suicide sample, followed by the antidepressant new user sample and lowest in the controls sample, likely due to more visits, service use and chart data for patients with greater severity or changes in severity. We note that due to the low prevalence of suicide attempts, the sensitivity estimates of suicide attempt are generally not as precise (e.g., one-sided 95% upper confidence limit for sensitivity of the control sample is 63%). Negative predictive values of administrative codes for behavioural variables were generally high, although positive predictive values varied. Positive predictive values of administrative codes for alcohol problems were 72-94%, and for tobacco use were 89-95%. However, for illicit drug use, PPV was only 48-65%.
Kashner et al. compared medical charts and administrative records for a random sample of 414 VHA inpatient discharges between July 1 and September 30, 1995 and found 93.7% agreement for alcohol dependence syndrome and 95.2% for drug dependence . Our findings of 90.1% agreement for alcohol dependence and 96.1% agreement for drug dependence are similar to these findings. Their study did not report sensitivity and specificity; however, based on data presented in the paper, for alcohol dependence, their sensitivity and specificity were 69.4% and 95.5%, respectively. For illicit drug dependence, sensitivity and specificity were 72.1% and 96.6%, respectively. These findings of high specificity are similar to our results, but sensitivity is higher than that found in our study. This higher sensitivity in Kashner et al. may be due to basing the study on inpatient discharges rather than the more comprehensive data available from chart review. A more recent Canadian study based on 4,008 randomly selected patients admitted from January 1 to June 30, 2003 at four teaching hospitals in Alberta reported 53.6% sensitivity and 99.1% specificity for alcohol abuse and 55.3% sensitivity and 99.0% specificity for drug abuse by comparing ICD-9 based diagnoses against the chart diagnoses . This finding is similar to ours, except we have slightly lower specificities (weighted accuracy of 97.2% for alcohol abuse and 98.5% for drug abuse).
The lower-than-desirable coding of these variables, and in particular of suicide attempt, might be anticipated. However, numerous studies have used these variables as covariates or even as primary endpoints . Unfortunately, if misclassification is such that a large proportion of these behavioural variables (e.g., suicide attempts) are missed, it will lead to an under-estimation not only of the prevalence of the particular condition, but also may have an impact on effect size estimates of interest. In addition, when accuracy of classification is different across the different subgroups, the systematic bias often can mask an association or create a spurious one, depending on the study design. For example, if suicide attempt is more accurately identified in drug users than non-drug users, the differential accuracy of suicide attempt may potentially lead to a spurious association between drug use and suicide attempt. Increasing the sample size will not eliminate such biases.
Assuming that chart diagnosis is the gold standard, the generally high specificity means that over-estimation of the prevalence based on administrative data from false positives is not likely. On the other hand, the low sensitivity indicates that administrative data-based diagnoses are likely to under-estimate the prevalence, and this has been seen across all four behavioural diagnoses.
Although neither low sensitivity nor low specificity are desirable, the impact of drawing conclusions based on variables with low sensitivity combined with high specificity is likely less undesirable than the conclusions drawn from studies based on variables with low specificity and high sensitivity. In studies where variables with low specificity are used, false positives will likely bias the estimation of the effects of interest whether the variables are used as endpoints or as primary predictors. However, in studies where variables with low sensitivity are used as primary endpoints, mainly statistical power will be reduced due to under-identified events. Similarly, in studies where these variables are used as predictors or covariates, the predictive power will be compromised and thus any adjustments for selection bias, for example, will not be as effective.
There are limitations to this study. Our study used data from the 12 months prior to index date, and a greater number of visits or longer length of any inpatient stays within the 12 months are likely to give a greater amount of information in both charts and administrative databases. Thus our results do not necessarily generalize to level of agreement for a single visit or a single inpatient stay. Our results may not be fully generalizable to patients without a depression diagnosis or care delivered outside of the VHA or to care delivered during other time periods within the VA. We also note that the time period of this study precedes multiple clinical initiatives the VHA has taken to increase the detection of suicidal behaviour and reduce suicide risk. Clinical reminders requiring screening for tobacco use  is in the developmental stage in the VHA, and started nationally in 2008 for alcohol abuse/dependence (based on the AUDIT questionnaire) . The VHA system potentially has fewer financial incentives to promote full diagnostic coding than many private sector settings, although the VHA allows up to 10 diagnostic fields for each encounter and has an electronic medical record that makes recording of conditions simple for busy clinicians, potentially enhancing the completeness of coding at each visit.
Another limitation is the lack of a true gold standard for these conditions. Both the chart notations and administrative diagnostic codes are limited to events that come to attention of VHA providers; thus medical records are a gold standard only in terms of recognized and diagnosed disorders that a clinician recorded. For substance use disorders, actual prevalence would, likely be much higher if validated diagnostic instruments were used. Many persons with such disorders are not identified and not treated. For instance, if a patient presents to an outside ER after a suicide attempt, this would not be captured within the VHA record unless they subsequently reported such an event to a medical or mental health provider. The goal of the study, however, is not to validate the administrative ICD-9 codes for suicide attempts and three substance use diagnoses using the true diagnosis, but to validate them using chart notation data which would be a more accurate, but more expensive source -- though less expensive than surveys -- of the behavioural disorder diagnoses in typical health services research studies.
Despite the limitations, the strength of our study is that it is based on samples drawn from complete nationwide records for all VHA patients, where all billing for patient care, even for specialists, occurs through the computer. We also note that our sampling was done to represent patients across region, years and gender and thus represent carefully the depression cohort at the VHA across regions over 5 years. Most importantly, to our knowledge this is the first study where agreement in suicide attempts determined by chart notation and E-codes was evaluated.