Association between Radiologists' Experience and Accuracy in Interpreting Screening Mammograms

Background Radiologists have been observed to differ, sometimes substantially, both in their interpretations of mammograms and in their recommendations for follow-up. The aim of this study was to determine how factors related to radiologists' experience affect the accuracy of mammogram readings. Methods We selected a random sample of screening mammograms from a population-based breast cancer screening program. The sample was composed of 30 women with histopathologically-confirmed breast cancer and 170 women without breast cancer after a 2-year follow-up (the proportion of cancers was oversampled). These 200 mammograms were read by 21 radiologists routinely interpreting mammograms, with different amount of experience, and by seven readers who did not routinely interpret mammograms. All readers were blinded to the results of the screening. A positive assessment was considered when a BI-RADS III, 0, IV, V was reported (additional evaluation required). Diagnostic accuracy was calculated through sensitivity and specificity. Results Average specificity was higher in radiologists routinely interpreting mammograms with regard to radiologists who did not (66% vs 56%; p < .001). Multivariate analysis based on routine readers alone showed that specificity was higher among radiologists who followed-up cases for which they recommended further workup (feedback) (OR 1.37; 95% CI 1.03 to 1.85), those spending less than 25% of the working day on breast radiology (OR 1.49; 95% CI 1.18 to 1.89), and those aged more than 45 years old (OR 1.33; 95% CI 1.12 to 1.59); the variable of average annual volume of mammograms interpreted by radiologists, classified as more or less than 5,000 mammograms per year, was not statistically significant (OR 1.06; 95% CI 0.90 to 1.25). Conclusion Among radiologists who read routinely, volume is not associated with better performance when interpreting screening mammograms, although specificity decreased in radiologists not routinely reading mammograms. Follow-up of cases for which further workup is recommended might reduce variability in mammogram readings and improve the quality of breast cancer screening programs.


Background
Breast cancer screening shows wide interobserver (radiologist) variability in the interpretation of screening mammograms [1][2][3]. This variability depends, among other factors, on the protocols for mammogram reading, the specific characteristics of each patient and breast and, to a large extent, on the radiologist's experience. Radiologists have been observed to differ, sometimes substantially, both in their interpretations of mammograms and in their recommendations for follow-up. Therefore, variability in mammogram reading may adversely affect the quality of screening programs by affecting recall rates, which may be low (undetected tumors or diagnostic delay) or high (provoking anxiety in women, false positives, and increased costs) [4][5][6].
Attempts have been made to explain variability among radiologists by experience-related factors, such as annual reading volume [7][8][9][10][11]. However, few studies have analyzed in depth and integrated into a single analysis several possible predictive factors related to radiologists' experience that could determine probable causes of the variability observed in mammogram interpretation (beyond annual reading volume) and that could help to improve the quality of screening programs.
The aim of this study was to determine the extent to which a series of experience-related factors affects the accuracy of mammogram readings.

Ethical issue
This study will follow the national and international guidelines stated at the Declaration of Helsinki and, furthermore, it will comply with the legal procedures regarding rules of data confidentiality (Law 15/1999 of December the 13th, about Personal Data Protection [LOPD]).

Mammogram selection
A random sample of 200 mammograms from asymptomatic women aged 50 to 64 years old who had participated in the first and second rounds of a populationbased breast cancer screening program in Barcelona City (Catalonia, Spain) was selected. The programme, which began in 1995, was based on the European Guidelines for Quality Assurance in Mammographic Screening [12] and its results met the Europe Against Cancer standards. All mammograms were located at the same radiology unit and readings were performed by the same team of radiologists. All mammograms were read by two radiologists and, when double readings led to different assessments, a third radiologist served as a tie breaker.
A total of 33,435 mammograms were stratified so that the sample included the four possible results of screening: true negatives, true positives, false negatives, and false positives. These results were validated by comparing the original interpretation obtained in the screening program with the result of the mammogram performed in the following round (2 years later). Histological confirmation was available in all women with a final diagnosis of cancer (both carcinoma in situ and invasive carcinoma).
Of the 200 mammograms selected, 30 (15%) corresponded to women with a definitive diagnosis of cancer (14% true positives, 1% false negatives). The remaining 170 mammograms (85%) corresponded to women with a definitive result of absence of cancer (55% true negatives, 30% false positives by recall). For each participant, double-view mammograms were taken (craniocaudal and mediolateral oblique), with a total of four films per participant. All mammograms complied with the following minimum quality criteria: breast situated centrally with the nipple in profile, visualization of all the breast tissue, the pectoral muscle shadow reached the nipple level, the nipple was seen in profile, and the inframamammary angle could be visualized. We excluded a small number of mammograms not meeting these criteria, as well as women requiring more than one film in one of the views, those who had undergone plastic surgery, those with breast implants, and women with radiopaque skin markers on the breast.
Original films (not copies) were always used. All the mammograms were obtained with a standard film-screen technique (Thosiba SSH 140 A and Bennett Trex Medical) using Agfa Mamoray-HT film.

Radiologists
A random sample of 28 radiologists from the radiology services of distinct health centers in Spain (general hospitals, district hospitals and primary care centers) was selected.
Before beginning data collection, the radiologists were asked if they routinely interpreted mammograms. Depending on their responses, the radiologists were then divided into two groups. The first group included 21 radiologists routinely reading mammograms but with different amounts of experience while the second group included seven radiologists who read mammograms infrequently or who were medical residents in radiology (radiologists not routinely interpreting mammograms).

Experience-related variables
To determine radiologists' experience in mammograph interpretation, the 21 routine readers were administered a questionnaire designed after a literature review of the pos-sible factors related to radiologists' self-reported experience [11]. Telephone interviews were performed by one of the project's researchers, with prior agreement from participating radiologists. The items referred to routine practice in mammogram interpretation during the year prior to participation in the study. The following experiencerelated factors were taken into account: Annual reading volume This variable included both screening and diagnostic mammograms. Annual volume was calculated on the basis of the number of readings made per week, bearing in mind holiday periods and rotations

Consultations
Radiologists were asked whether they routinely (frequently) consulted with other radiologists when interpreting mammograms. This variable is an indicator of whether the mammogram reading was performed individually or as a team.

Years of experience in reading mammograms
The number of years of experience reading both diagnostic and screening mammograms was evaluated without taking into account years of specialist practice.

Radiologists' age
Age at interview (as a proxy variable of experience).

Focus on breast radiology
The percentage of working hours included the percentage of time devoted to breast radiology, both mammograms and other diagnostic techniques, during radiologists' working hours.

Feedback
Radiologists were considered to obtain feedback when they worked in a team with a protocol for the follow-up of all women in whom they recommended further workup after the screening test (imaging tests or invasive procedures).

Reading procedure
Given that the aim was to reproduce as far as possible normal mammogram reading practice, the 28 radiologists independently read the set of 200 mammograms at their workplace. For each breast, the radiologists provided information on the following variables: result, breast density (from less dense to more dense), lesional pattern (nodular, distorting fibrous, mixed, calcified, and parenchymatous asymmetry) and location of the lesion. The results of readings were reported according to the Breast Imaging and Reporting Data System (BI-RADS) [13,14]. In the case of more than one lesion in the same breast, only the most severe lesion was reported. At no time were previous mammograms available to radiologists for comparison while interpreting films.
At the beginning of the study, a session was held with all the radiologists to unify the criteria for mammogram data. The radiologists indicated the results in a standard data collection form that included the norms for completion explained in the initial session.
The participating radiologists were blind to both the study design and the proportion of cancers in the sample, although they were informed that cancer cases were oversampled.

Statistical Analysis
To calculate sensitivity and specificity, mammograms were considered positive (women were recalled for additional investigations) when classified as BI-RADS III, 0, IV or V. Readings were considered negative when they were classified as BI-RADS I or II. A single BI-RADS category was determined for each woman, based on the most malignant of the two breasts.
The area under de the ROC curve (AUC) was evaluated to compare the 21 radiologists routinely interpreting mammograms and the seven radiologists who did not routinely interpret mammograms.
For the univariate analysis, sensitivity and specificity were evaluated in the 21 routine readers according to each experience-related variable, stratified into two levels with a cut-off indicating presumably less and presumably more experience in mammogram reading. The statistical significance of differences in sensitivity, specificity and accuracy between the two levels of routine readers, was determined by generalized score tests trough marginal models (link logit).
Sensitivity, specificity and accuracy were then modeled by use of multivariate logistic regression estimated through the method described in detail by Smith-Bindman et al [15]. Moreover, a global measure of accuracy was calculated; the radiologist was assumed to be accurate when mammograms from women with cancer were classified as positive and those from women without breast cancer as negative. These models were adjusted by all the experience-related variables. Because of their characteristics, the seven radiologists not routinely interpreting mammograms were excluded from this regression. Given that 15% of the women in the sample had cancer and 85% were cancer-free, to estimate accuracy weights were used to assign equal importance to interpretation of mammograms from women with and without cancer. The analysis took into account the correlation due to the radiologists' consistent interpretation across the 200 films. Therefore, marginal models were estimated based on generalized estimating equations (GEE). The analysis was performed through link logit and an exchangeable structure in working correlation matrix. The GENMOD procedure of SAS 9.1 was used.
The 21 routine readers had a mean age of 47 years (range, 40-60 years), had 12 years' experience of reading mammograms (range, 4-22 years), had read an average of 5773 mammograms in the year prior to participating in the study (range, 1890-13230 mammograms), and spent an average 56% of their working hours on breast disease (range, 15-100%). Eighty-one percent (17 of 21) of the radiologists routinely consulted colleagues and 86% (18 of 21) routinely obtained feedback on cases for which they recommended further workup (data not shown).
Given the characteristics defining the group of radiologists not routinely interpreting mammograms, experiencerelated variables were not evaluated in these seven radiologists.
Higher sensitivity was often associated with a higher falsepositive rate for the 28 radiologists (Figure 1). Given the limited number of cases of cancer and of non-cancer in the sample, these were susceptible to small variations in classification during mammogram reading -hence the wide variability. The AUC for routine readers and radiologists not routinely reading mammograms were evaluated in 70.3 (95% CI 73.2 to 77.1) and 75.2 (95% CI 66.9 to 73.8) respectively, but the greatest difference seems to be observed in the fraction of false-positives in the interval 0%-20% (Figure 2).
The variable of annual reading volume showed no significant differences between radiologists reading more than 5,000 mammograms annually and those reading less than 5,000 mamograms annually. No differences were found in sensitivity (p = 0.193) or in specificity (p = 0.170). No patterns were observed when we compared this variable through the representation of sensitivity versus the fraction of false-positives ( Figure 3).
The multivariate model was used to evaluate the 21 radiologists routinely interpreting mammograms. The only measure showing statistically significant diffearences was  (Table 3).

Discussion
In the present study, wide variability in radiologists' interpretations of the sample of mammograms was observed. The group of radiologists not routinely interpreting mam-mograms showed no differences in average sensitivity in mammogram interpretation compared with routine readers but showed significantly less specificity and accuracy. Of the various experience-related factors used to evaluate this variability, annual reader volume was only important when radiologists not routinely interpreting mammograms were compared with routine readers, the latter showing greater specificity and accuracy. In contrast, no significant differences in sensitivity, specificity or accuracy were found among routine readers between those reading less than 5,000 mammograms per year compared with those reading more than 5,000 films. When the remaining experience-related variables were incorporated into a mul True-positive rate (sensitivity) of the 28 radiologists versus the false positive rate (1-specificity) Figure 1 True-positive rate (sensitivity) of the 28 radiologists versus the false positive rate (1-specificity). Rates not adjusted for patient variables.

Twenty-one radiologists routinely interpreting mammograms
Seven readers who did not routinely interpret mammograms Empirical ROC curves of the seven radiologists not routinely interpreting mammograms and of the 21 routine readers  tivariate model, obtaining feedback on cases for which further workup was recommended increased specificity.
To guarantee the quality of population screening programs, substantial efforts have been made during the last decade to understand the role played by radiologists' experience in the variability of screening mammogram interpretations, as well as to identify the radiologist-associated factors determining accuracy. One of the factors considered most important is annual reading volume. In 1998, Elmore et al [11] observed that the annual volume did not significantly influence the recommendation for workup but concluded that radiologists interpreting rela-tively few mammograms each year, even over many years, may not be sufficiently experienced to obtain high levels of sensitivity and specificity. Kan et al [10] demonstrated that a minimum of 2,500 annual readings guaranteed a better cancer detection rate.
Since 1998, two distinct lines of argument can be discerned: Esserman et al [16] and Smith-Bindman et al [15] concluded that the quality of mammogram readings could be improved by increasing annual reading volume, while Beam et al [17] and Barlow et al [8] reported that reading volume was not an important variable and that radiologists' interpretative performance is a multifactorial True-positive rate (sensitivity) of the 21 routine readers versus the false positive rate (1-specificity) distinguishing between radiologists who read less than 5,000 mammograms per year and those who read more Figure 3 True-positive rate (sensitivity) of the 21 routine readers versus the false positive rate (1-specificity) distinguishing between radiologists who read less than 5,000 mammograms per year and those who read more.
Rates not adjusted for patient variables.
Radiologists reading more than 5,000 mammograms Radiologists reading less than 5,000 mammograms process in which a large number of factors play a role. A recent report by the Institute of Medicine containing an exhaustive review of the literature had no been able to demonstrate a clear relationship between volume alone and accuracy [18]. Our opinion is that, unfortunately, in the attempt to guarantee the quality of population screening programs, the study of variability in mammogram readings has been excessively simplified by evaluating the role played by the variable of annual reading volume, with fairly arbitrary cut-off values, beyond which greater accuracy would be achieved.
Our results, like those of other studies, cast doubt on the major role that has been assigned to the variable of annual reading volume as an indicator of radiologists' experience. As in other variables, we observed a positive association between reader volume and accuracy when comparing the group of radiologists not routinely interpreting mammograms with the group of routine readers (established on the basis of the recommended number of 5,000 mammogram readings annually by the European Guidelines [12], the National Health Service in the United Kingdom [19] or Esserman et al [16]). However, we found no significant differences between the two levels of routine readers in either the univariate or the multivariate analyses. Therefore, in addition to questioning the importance of volume, we highlight the role played by other experiencerelated variables.
According to the results of the multivariate analysis in the present study, one of the most important factors determining experience is feedback (OR 1.37; 95% CI 1.03 to1.85), since it allows radiologists to perform a self-evaluation and become aware of the accuracy of previous readings. Moreover, we believe that the design of screening programs should take this factor into account. The other two significant variables found in this study, focus on breast radiology and radiologists' age should be interpreted conjointly because these variables could show a certain degree of colinearity. Thus, a radiologist aged more than 45 years old spending less than 25% of the working day on breast radiology could correspond to the profile of a highly accurate reader.
Since we found no significant differences in sensitivity between the group of radiologists not routinely interpreting mammograms and the group of routine readers, we believe that sensitivity could present a certain ceiling effect, inherent to the experience-related factors studied to date. This result had previously been discussed in an article explaining how mammography sensitivity has not changed for decades [20,21]. Therefore, we believe that sensitivity is not an appropriate measure to evaluate accuracy, at least not in studies based on mammography samples. We also used the area under the ROC curve, at a specificity of 90%, which allowed us to rank the 28 radiologists according to performance (data not shown). However, for the multivariate analysis, because of correlated data, and given that our objective was to evaluate average accuracy (rather than the individual effect of each radiologist on accuracy), the analysis that we believe optimal was based on marginal models based on generalized estimation equations.
We emphasize our study design because, in a sample of screening mammograms, in which selection of thousands of mammograms and hundreds of radiologists is not feasible and in which cancer cases are necessarily oversampled (bearing in mind that the incidence of breast cancer is approximately 3-8‰ in an incident screening round), the composition of the sample is a key factor for understanding the results obtained. These results depend basically on the proportion of true positives, true negatives, false positives, and false negatives chosen from the program to compose the sample. This composition was chosen according to criteria published by Kerlikowske et al [22]. In this sense, given that the percentage of mammograms with uncertain diagnosis in our study was high, we found a large number of false positives and false negatives and consequently the average sensitivity and specificity were only 84% and 64% respectively, which is substantially lower than the sensitivity and specificity expected in a screening program.
Precisely because we chose a sample not representative of the population, a possible limitation of our study can be attributed to contextual bias. To evaluate the extent to which sensitivity and specificity were influenced by the sample, we performed an ad hoc analysis using only the 138 mammograms with a true positive and true negative result, and found that sensitivity did not vary, but that specificity was increased from 64% to 77%. Thus, we justify the study design based on a sample of mammograms by the difficulty of performing a prospective study in which recruitment of a sufficiently large number of radiologists to guarantee adequate statistical power would be difficult.
In addition to experience-related factors, variability is also explained by differences in organization and protocols for reading mammograms, which are not homogeneous in all countries [23,24]. In Europe, screening programs are population-based, publicly financed and adhere to European Guidelines that guarantee the quality of the process [12,25], while in the USA, financing and organization are managed basically by private insurance. However, the characteristics of the protocol should also be taken into account, in which, based on mammography quality, there are also differences in the system of double reading and the method of tie break, in the number of views, in the percentage of clinical investigations and/or the adaptation of the BI-RADs, which greatly hampers comparisons among studies.

Conclusion
In conclusion, the results obtained in the present study are in line with those of the most recent publications, in which radiologists' experience depends on multiple factors; therefore experience-related variables should not be interpreted in isolation.
There may be an optimal combination of experience and volume that is required to achieve reasonable performance, but greater experience and volume may not contribute to greater improvement. The danger of the volume argument is that even if an association is found between better performance and higher volumes, this association may not be causal. Higher volume radiologists may have better equipment, better feedback loops, etc. that could make them appear to be better readers. What is needed is a demonstration that performance actually improves over time with each mammogram read within the practice of individual radiologists.