The effect of the look-back period for estimating incidence using administrative data

Background The look-back period is needed to define baseline population for estimating incidence. However, short look-back period is known to overestimate incidence of diseases misclassifying prevalent cases to incident cases. The purpose of this study is to evaluate the impact of the various length of look-back period on the observed incidences of uterine leiomyoma, endometriosis and adenomyosis, and to estimate true incidences considering the misclassification errors in the longitudinal administrative data in Korea. Methods A total of 319,608 women between 15 to 54 years of age in 2002 were selected from Korea National Health Insurance Services (KNHIS) cohort database. In order to minimize misclassification bias incurred when applying various length of look-back period, we used 11 years of claim data to estimate the incidence by equally setting the look-back period to 11 years for each year using prediction model. The association between the year of diagnosis and the number of prevalent cases with the misclassification rates by each look-back period was investigated. Based on the findings, prediction models on the proportion of misclassified incident cases were developed using multiple linear regression. Results The proportion of misclassified incident cases of uterine leiomyoma, endometriosis and adenomyosis were 32.8, 10.4 and 13.6% respectively for the one-year look-back period in 2003. These numbers decreased to 6.3% in uterine leiomyoma and − 0.8% in both endometriosis and adenomyosis using all available look-back periods (11 years) in 2013. Conclusion This study demonstrates approaches for estimating incidences considering the different proportion of misclassified cases for various length of look-back period. Although the prediction model used for estimation showed strong R-squared values, follow-up studies are required for validation of the study results.


Health insurance claims as a big data
Administrative data in healthcare primarily refer to the vast medical information available in the form of electronic health records through administrative or health claims data [1]. As the availability of digitized administrative records are increasing, health researchers are able to use these large longitudinal cohort datasets to estimate epidemiologic indicators, such as the incidence and prevalence of various conditions [2][3][4][5][6][7][8][9][10][11][12][13][14][15]. The strengths of this type of large population studies include having a large sample size and avoiding selection or participation bias [16].
The Korean National Health Insurance Service (KNHIS) covers majority of the population as a single payer reimbursing both public and private institutions. All clinics and hospitals submit health insurance claims to the Health Insurance Review and Assessment Service (HIRA) for the claims review each month. The insurance claims include diagnoses (as defined by the International Classification of Diseases 10th revision, ICD-10), demographic information, and medical charges. KNHIS and HIRA share the claims database which represent the entire Korean population and is a major strength in ensuring its applicability for epidemiologic and disease research.

Estimation of incidence rates from administrative data
Estimating incidence provides a foundation for epidemiologic research, data for resource allocation in health care services, and valuable information for disease prevention. The incidence rate is defined as the ratio of new cases to the total population at risk of the disease. However, the identification of new cases from the administrative data is difficult due to the limited information of patient's disease status prior to the observatory time span of the data.
A common procedure in determining the incident cases is to exclude cases with the respective diagnoses during the look-back period. A long look-back period allows us to identify more accurate incident cases than a short look-back period. But with a long look-back period, valuable data is lost for analyses. A short lookback period, on the other hand, carries the risk of misclassifying prevalent and recurrent cases as incident cases [17,18].
Studies have used various time lengths for look-back period [19][20][21][22]. Typically, studies have used 3 to 10 year look-back period [19][20][21][22] because a look-back period of less than 3 years can lead to extremely overestimated incidences [23]. However, due to limited data, numerous studies have not considered a look-back period or reported a diagnosis-free interval of 1, 2, or 3 years [24][25][26][27]. Additionally, most of studies focused on the estimated the one-year incidence by applying different look-back period [28][29][30], and there were few studies investigating the incidence trend in longitudinal data. In this study, we intended to investigate the incidence trend considering the increasing look-back period every year in the longitudinal administrative data.
The purposes of this study are to evaluate the impact of various look-back period on the observed incidences of uterine leiomyoma, endometriosis and adenomyosis which are the most common gynecologic diseases in reproductive women and associated with the infertility and adverse pregnancy outcomes [31], and to estimate the true incidences with their trends considering the misclassification error rates using the longitudinal administrative health data in South Korea. While it is advisable to have a sufficiently long look-back period when calculating the incidence using administrative data, we sought a way to minimize data loss.

Data source
We conducted a retrospective population-based cohort study using the National Health Insurance Service-National Sample Cohort (NHIS-NSC) 2002-2013. The data were produced by the KNHIS using a systematic sampling method to generate a representative sample from the target population of 46,605,433 individuals in 2002. The database is comprised of 1,025,340 subjects which accounts for approximately 2.2% of the total eligible Korean population in the year of 2002 who were followed up for 11 years until 2013. The representativeness of the data had been presented elsewhere [32].
It is a semi-dynamically constructed cohort database with individuals that have been followed up to the time of death, emigration, or until the end of the study period and addition of newborn infants included into the database annually [32]. This database includes all medical claims filed from January 2002 to December 2013. More details of the cohort are described elsewhere [32].

Selection of subjects
Patients in Korea tend to visit several healthcare institutions for any reason, as the patients can access clinics, specialists, and hospitals without restriction. Thus, it is possible for a patient to visit several clinics/hospitals in one day, has multiple diagnostic codes at a time, has multiple claims on the same day in the same clinic/hospital, or has both outpatient treatment and hospital admission on the same day. Therefore, one claim should be selected to define incidence in consideration of all these cases. We set priorities in the following order.
First, priority is given to the claim with the earlier hospital visit date. If there are several patients who visited hospital on same date, inpatient's statement takes priority over outpatient's one. Among several outpatient statements, a statement with a high ranking of diagnosis codes is selected in ascending order. If the order of the diagnostic codes is the same, priority is given to that with higher medical costs. Finally, priority is given to the one with earlier billing number. Even though individuals have some gaps of few years between 2002 and 2013 in the record, we considered they are continually insured patients and included in the subject.
A flow chart indicating the number of patients with one of three gynecological diseases is shown in Fig. 1. The population denominator was a total of 319,608 women aged 15-54 who were eligible for the National Health Insurance in 2002 among 512,082 female individuals from the KNHIS cohort database. Those women were followed up for 11 years until 2012. The incident cases were defined using the standardized codes from the Korean version of the International Classification of Diseases 10th Edition (ICD-10). Cases with diagnostic codes of the target diseases coded in the health insurance claims between 2002 and 2013 regardless of service types were identified; The target diseases of interest were uterine leiomyoma (ICD-10: D25, D25.0, D25.1, D25.2, D25.9), adenomyosis (ICD-10: N80.0), and endometriosis To identify the patients with prior history of the disease, one-year look-back period as of 2003 was applied at the discretion of obstetricians and gynecologists that patients would visit the gynecologists within one year after the onset of diseases. There were 43,814 patients after excluding patients with the target diseases in 2002. Patients who had concurrent diagnoses of uterine leiomyoma, adenomyosis, or endometriosis were counted in each of the targeted disease. Therefore, there were 37, 431 patients with a diagnosis for uterine leiomyoma, 8897 for adenomyosis, and 5908 for endometriosis.

Estimated incidence
To assess the relationship between the look-back period and the number of misclassified cases, the annual number of patients diagnosed with either uterine leiomyoma, adenomyosis, or endometriosis (prevalent cases) from 2003 to 2013 were determined, and the number of prevalent cases misclassified as incident cases were identified with increasing look-back period by each observation year (Additional File 1).
The association between the year of diagnosis and the number of prevalent cases with the misclassification rates by each look-back period was investigated. Based on the findings, prediction models on the proportion of misclassified incident cases were developed using multiple linear regression. The model of best fit was selected by using the lowest root mean square error (RMSE) or the largest adjusted R-squared value, which are good measures of assessing the accuracy of prediction model. Estimated incidences were calculated using the best prediction model and compared with the observed incidences.

Misclassification rates of each year by different length of look-back period
The Table 1   The grey cells at the last column of each observation year show the number of prevalent cases misclassified as incident cases which were discovered by applying the look-back period (Table 1). In 2003 with a one-year look-back period, among a total of 3902 patients with uterine leiomyoma, there were 785 (20.1%) cases that were misclassified as incident cases. In 2013, however, with 11 years of look-back period among 8348 cases, the misclassified as incident cases increased to 4522 (54.2%). Tables 2 and 3 show the proportion of patients diagnosed with adenomyosis and endometriosis and misclassified as incident cases for each look-back period. With a look-back period of 11 years, 733 (41.6%) patients with adenomyosis and 494 (50.3%) patients with endometriosis were estimated to have prior history of the disease.

Prediction of the proportions of misclassification
The year of diagnosis and the number of patients were linearly related with the proportion of misclassification for uterine leiomyoma, adenomyosis and endometriosis, and the look-back period was logarithmically related with the proportion of misclassification ( Supplementary  Fig. 1, 2 and 3). Using these findings, four prediction models were developed (Table 4). Model A was selected as the model of best fit because it had the smallest RMSE and highest estimated R-squared value. The independent variables were the year of diagnosis and the log-transformed look-back period. Table 5

Discussion
Administrative health claims database was used to calculate the annual incident cases of uterine leiomyoma, adenomyosis and endometriosis in South Korea (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013). The proportion of misclassified prevalent cases as incident cases was estimated according to various length of look-back period in years. As the look-back period increased, the proportion of misclassified incident cases decreased. Shorter look-back period incurred incidences with greater proportion of misclassification.
It is difficult to accurately identify new cases in patients diagnosed each year because misclassification bias  1 Prevalent cases detected by look-back period divided by prevalent cases 2 Estimated misclassified cases (n) and the misclassification rate (%) for 11 years look-back period calculated using the prediction model exists in which the prevalent case is considered as an incidence case according to look-back period changing every year during the research period. Thus, to minimize this systematic error, we used 11 years of claim data to estimate the incidence by equally setting the look-back period to 11 years for each year using prediction model.

Optimal look-back period for annual incidence
As mentioned in the Abbas's study, the optimal lookback period for annual incidence while minimizing the rate of misclassification depended on the nature and the stage of the respective diseases [23]. In uterine leiomyoma and adenomyosis, the proportion of misclassified cases decreased by about 50% when the look-back period increased from 6 years to 7 years, and in endometriosis, it decreased by about 10% when the look-back period increased from 7 years to 8 years. The proportion of misclassified cases of endometriosis in 2007 is − 0.6 which is considerably smaller than 7.6, the rate of previous year. Therefore, disease-specific look-back period required at least 7 years for uterine leiomyoma and adenomyosis, and 8 years for endometriosis. The extent of misclassification varies by diseases even though the same length of look-back period was applied. In 2003 with one-year look-back period, the proportion of misclassification for uterine leiomyoma was 32.8%, while for adenomyosis and endometriosis were 10.4 and 13.6%, respectively. Similarly, in the 11 years of lookback period in 2013, the proportion of misclassification for uterine leiomyoma was 6.3% and − 0.8% for adenomyosis and endometriosis, which is negligible.
Incidences can be affected by external effect. The number of endometriosis patients significantly decreased in 2007, and thereafter did not increase. One possible reason is that the HIRA has strengthened coding requirement to use full digit detail codes in 2006. Subsequently, the number of endometriosis patients with N80 might be redistributed to N80.0 for adenomyosis and N80.1 to N80.9 for the endometriosis. The estimated number of incident cases of the disease in 2013 should be interpreted with caution. When the estimated incidence is lower than the observed incidence, the observed incidence should be used instead of the estimated incidence for the practical use.
According to the Organization for Economic Cooperation and Development (OECD) statistics in 2018, the annual number of outpatient visits per capita in Korea in 2016 was 17.0 which is the highest among OECD countries and 2.5 times more than the OECD average (6.9) [33]. As such, the same duration of lookback period using administrative health data in Korea is estimated to have an increase in the proportion of misclassification than other OECD countries.

Strengths and limitations
The strengths of this study include large sample size and long observation period of 12 years. This increases the accuracy for calculating the incidences and proportion of misclassifications. However, the study has several limitations.
In the regression model for estimating the number of incident cases, a linear function for the observation year and a log function for the look-back was used. There were 11 data points for the one-year look-back, but only one point for the 11-year look-back. Although the prediction model had a good RMSE and R-squared, the model was based on uneven distribution of the observed data points may adversely affected the fit of the model.
The study has inherent limitations as this study was based on secondary data analyses of the NHIS cohort database. We could not definitely confirm the diagnosis codes for every single patient in the database since the diagnostic code of the claim data alone cannot guarantee the accuracy of the diagnosis [34]. According to Park et al. [35], about 70% of primary diagnosis codes concurred with medical records. Issues concerning studies involving administrative data are well described in Mazzali, C. and and P. Duca's study [36]. When the cases were confirmed by prescription codes and procedure code in addition to the diagnostic codes, the incidences would be lower than this study results. Lastly, asymptomatic and/or undiagnosed patients cannot be detected using the health claims data. This would decrease the proportion of the true incident cases of the diseases.

Conclusion
Using the NHIS administrative heath database, various length of the look-back period was applied to estimate the incidences of uterine leiomyoma, adenomyosis, and endometriosis and determine the different proportion of misclassification errors for each look-back period. The prediction model was used to adjust the misclassification errors that occur when calculating incidence trend derived from longitudinal administrative data. Although the prediction model used for estimation showed strong R-squared values, follow-up studies are required for validation of the study results.
In the longitudinal data, the look-back period applied for incidence estimation generated different misclassification errors for each look-back period. We proposed a method to adjust the misclassification errors when calculating the incidence using administrative data. Even though we focused on the three gynecological disease in this study, the approaches presented in this study are applicable to other diseases as well.