Cross-validation of comorbidity items in two national databases in a sample of patients with end-stage kidney disease

Background The use of national medico-administrative databases for epidemiological studies has increased in the last decades. In France, the Healthcare Expenditures and Conditions Mapping (HECM) algorithm has been developed to analyse and monitor the morbidity and economic burden of 58 diseases. We aimed to assess the performance of the HECM in identifying different conditions in patients with end-stage kidney disease (ESKD) using data from the REIN registry (the French National Registry for patients with ESKD). Methods We included all patients over 18 years of age who started renal replacement therapy in France in 2018. Five conditions with a similar definition in both databases were included (ESKD, diabetes, human immunodeficiency virus [HIV], coronary insufficiency, and cancer). The performance of each SNDS algorithm was assessed using sensitivity, specificity, positive predictive values (PPVs), negative predictive values (NPVs), and Cohen’s kappa coefficient. Results In total 5,971 patients were included. Among them, 81% were identified as having ESKD in both databases. Diabetes was the condition with the best performance, with a sensitivity, specificity, PPV, NPV, and Kappa coefficient all over 80%. Cancer had the lowest level of agreement with a Kappa coefficient of 51% and a high specificity and high NPV (94% and 95%). The conditions for which the definition in the HECM included disease-specific medications performed better in our study. Conclusion The HECM showed good to very good concordance with the REIN database information overall, with the exception of cancer. Further validation of the HECM tool in other populations should be performed. Supplementary Information The online version contains supplementary material available at 10.1186/s12913-023-10145-y.

exhaustive.In addition, it is less costly, as the data is collected systematically and relatively easily accessible, simultaneously eliminating recall bias, as it relies on data collected systematically and not based on patient reporting with potential recollection mistakes.National databases are helpful for longitudinal studies, as they make it possible to include extended follow-up times and large sample sizes, as well as rare events and epidemiological surveillance or surveys.Such databases are, however, not exempt from information bias [2,3] as the information tends to be essentially administrative.For example, pharmaceutical information is limited to the dispensation of prescribed and reimbursed medications that are registered in the insurance records [1].Over the counter medications can be easily missed.
The French population benefits from universal public healthcare coverage.All information concerning the use of the healthcare system is recorded in the National Health Data System ("Système National des Données de Santé, SNDS") [4].Since 2012, the French National Health Insurance has developed a tool based on the SNDS to analyse and monitor the morbidity and economic burden of 58 treated diseases, chronic treatments, and episodes of care through healthcare utilization [5].Healthcare Expenditures and Conditions Mapping (HECM) allows the identification of diseases by means of medical algorithms based on the diagnoses for hospitalization, long-term disease diagnoses, and reimbursement of specific treatments for certain diseases for a given year and a period up to four years before.This algorithm is repeated for each year providing a cross-sectional study repeated over time [6].The HECM has provided information to improve healthcare policies in France (preparing the French Social Security Funding Act and the Public Health Act).The findings of the HECM on disease prevalence and expenditures are similar to those of studies conducted in other countries [6].
A previous study in France compared the performance of various SNDS-based algorithms to identify treated diabetes against clinical data from CONSTANCES (a national French cohort of professionally active or retired salaried workers and their families), showing excellent performance for the three algorithms, including HECM's current algorithm concerning diabetes [7].However, such algorithm validity assessments are still scarce.Data from registries offer this opportunity because they provide gold-standard data: they are exhaustive for a given territory, registered manually, and controlled by experienced research assistants.
We aimed to assess the performance of five HECM algorithms on patients with ESKD (ESKD, diabetes, HIV infection, cancer, and coronary disease) against information on the French Renal Epidemiology and Information Network (REIN).The REIN database provides national quality-controlled data on patients with ESKD.It relies on a network of nephrologists, epidemiologists, patients, and public health representatives who are coordinated regionally and nationally by the French biomedical agency, collecting exhaustive information on patients with ESKD (treatment and its changes, demographics, comorbid conditions, treatment center location, etc.) [8,9].

The REIN registry
The REIN registry was started in 2002 and covered all of France by 2012.It includes all patients receiving renal replacement therapy (RRT) in mainland France and its overseas territories.The REIN database collects information on patient characteristics (body mass index [BMI], age, sex, RRT modality, date of RRT start) and conditions (e.g., diabetes, coronary artery disease, cancer) based on medical records.Nephrologists, health managers, nurses, medical secretaries, and research assistants collect the data.Continuous controls are ensured during the year (with a strict focus on inclusion criteria, which excludes patients with acute renal disease).Yearly updates are performed to allow the inclusion of new information on patient treatment status, as well as comorbid condition updates.Detailed information on the definitions of comorbidities and coding in REIN can be found in Caillet et al. [9].Quality controls and data collection procedures are detailed in Couchoud et al. [8].

The SNDS
The SNDS (a medico administrative database) collects individual data from various French health insurance schemes.This database contains exhaustive expense and reimbursement information on hospitalizations, ambulatory care, medications, laboratory analyses, and consultations for both public and private healthcare facilities, as well as transportation, compensatory daily allowances, and third-party compensatory indemnity, regardless of the payer of the services (state, complementary insurance, or out of pocket).It does not record primary care consultation diagnoses, or clinical results.For reimbursement, the SNDS includes information on long-term chronic diseases (LTD, a status that guarantees 100% reimbursement for healthcare expenses related to the disease when reported, given the fact that the patient could already been considered for LTD due to another medical condition) [4].
The HECM applied to the SNDS database uses discharge diagnoses, as well as the chronic diseases registered for healthcare reimbursement and/or specific medical acts/drugs to identify patient conditions (different algorithms for each condition, see details in supplementary Table 1).These algorithms are applied to all beneficiaries of the health insurance regimens in France (66.3 million inhabitants) that have used the healthcare system at least once during the year of interest.The pathologies, chronic treatments, and use of healthcare identified are, for the most part, non-exclusive, as the same person can be affected by several pathologies [5].

Study population
We included patients over 18 years of age that started RRT (either dialysis or renal transplant) in France in 2018 identified through the REIN registry and who could be linked to the SNDS database.
Independently of the present study, all REIN patients were matched with SNDS patients over the available extraction period, i.e. 2006-2020 by the national coordination of REIN.with an indirect deterministic linkage that uses a combinations of 6 items: sex, age, location of residence, date and facility of kidney transplant/or start of dialysis treatment, and date of death, if available, with varying granularity (age ± 1, location at municipality or district, date ± 2 months, exact facility or facility in the same district).Further details on the linkage procedure applied yearly can be found in Raffray et al. [10].For the purpose of this study, we selected only subjects from our incident population considered to have "good linkage".Good linkage was defined as exact match on sex plus: either 1/ exact linkage on date of death, whatever the granularity of the other 4 items either 2/ two or more exact match on the following items: age, location of residence, date and facility of RRT.Other combinations were not included in the present study.

Health conditions compared
For the purposes of this study, the conditions identified in the REIN registry were considered as the reference.
The following conditions identified in both the SNDS and REIN registry were included in this study: ESKD, diabetes, HIV, coronary disease, and cancer (see definition for each in Supplementary Table 1).These conditions were selected, as their identification method in both databases were comparable.In addition, the conditions studied presented an opportunity to explore the performance of the algorithms' with different characteristics.Diabetes and HIV are disease specific and likely to be well identified in pharmaceutical records, one being very frequent, whereas the other is less.Coronary disease identification relies on mainly clinical criteria and cancer because it represents a combination of both cases.The definitions of other conditions identified with the HECM were too different compared to those in the REIN registry.

Statistical analysis
A descriptive analysis was performed comparing subjects with and without good REIN-SNDS linkage (patients included vs those excluded from the study).These included survival after 2018 (recruitment year), first RRT, sex, comorbid conditions, age, and regions of residence in France.
The performance of each algorithm was evaluated using sensitivity, specificity, the positive predictive value PPV), the negative predictive value (NPV), and Cohen's kappa coefficient, together with the 95% confidence interval (CI).The level of agreement was assessed as poor (K-coefficient ≤ 0.20), fair (0.20 ≥ K-coefficient ≤ 0.40), moderate (0.40 ≥ K-coefficient ≤ 0.60), good (0.60 ≥ K-coefficient ≤ 0.80), or very good (K-coefficient ≥ 0.80) [11].All populations included in the REIN registry had ESKD by definition.Therefore, only true positives and false negatives could be calculated for the item ESKD.
To account for the fact that HECM algorithms were designed for medico-economical purpose and individuals may not have been taken into account when they are treated at the beginning or end of the year, a secondary analysis was performed for subjects whose comorbidity data did not match for the year 2018.In these cases (unmatched conditions for 2018), the comorbidity information from the HECM for the year 2017 and 2019 were taken into consideration and new comparisons were performed.As an example, if a patient with a diabetes status did not match for the year 2018, we considered their HECM diabetes status for the year 2017 and repeated the comparison for the whole population.This secondary analysis was carried out for all conditions.
A comparison of certain characteristics was conducted (survival after 2018, first renal replacement treatment, sex, age, region of residence, nephropathy at recruitment, acute kidney disease diagnosis) to better understand the population whose conditions matched and did not match for the year 2018.
All analyses were performed using SAS enterprise guide software (version 8.3 SAS institute Inc., Cary, NC, USA).

Ethical approval
The REIN registry creation was approved by the relevant French committees: the Comité consultatif sur le traitement de l'information en matière de recherché (CCTIRS N°03-149) and the Commission nationale de l'informatique et des libertés (CNIL N° 903,188).
The French national health insurance (CNAM) in charge of the SNDS (Système National des Données de Santé) has permanent access to the pseudonymized reimbursement data in application of the provisions of articles R. 1461-12 et seq. of the French Public Health Code, with rules and criteria similar to the Helsinki declaration and permanent full access to the SNDS by decree (Décret n° 2016-1871 du 26 décembre 2016 relatif au traitement de données à caractère personnel dénommé « système national des données de santé»).The CNAM has authorization to perform studies based on SNDS data from the CNIL (National independent Commission for Computing and Freedom, the French data protection agency for sensitive information).All methods were carried out in accordance with relevant guidelines and regulations.

Results
In total, 8,309 individuals were identified as incident patients in the REIN registry for the year 2018 (present in both databases).Among them, 5,971 patients were included in our study because of good linkage between the REIN and SNDS databases.The excluded population (those without good linkage) was more likely to include those who died in the year of their diagnosis, started RRT with dialysis, were older, or were a resident of Ile-de-France (Paris region) (Table 1).

ESKD status
With the HECM 2018 81% of the subjects with ESKD were true positives.In a secondary analysis that included information on the ESKD status from the HECM for 2019, the percentage of patients correctly identified by the SNDS database increased to 93% (Table 2).The 1,126 false negative ESKD patients (HECM 2018) were more likely died in the year they started treatment, started treatment with dialysis, among the older population, residents of Ile-de-France, and classified in the SNDS database as having acute renal disease (Table 3).

Diabetes
Forty-two percent of the population identified in the REIN database were registered as having diabetes.Eight percent of the population's diabetes identification differed between the databases (distributed equally between false positives and false negatives) for their diabetes status between the two databases for HCEM 2018 (Table 2).The population of 530 patients with differing diabetes status had a higher proportion of patients who had transplantation as their first RRT, were over 75 years of age, or were residents of Ile-de-France (Table 3).The Kappa coefficient of agreement was found to be very good (82%), as were the specificity, sensitivity, NPV, and PPV (over 89%).No great improvement was observed when including the patients' diabetes status in the HECM for 2017 or 2019.

HIV infection
Only 1% of patients identified in the REIN database were HIV positive.Approximately 0.4% of the population differed between the databases based on their HIV/AIDS status (Table 2).Among the 21 disparate patients based on HIV status, no transplant patients were misclassified, a higher percentage were aged between 45 and 64, and most were identified as residents of Ile-de-France (Table 3).This comparison showed a good Kappa coefficient of agreement.The sensitivity and PPV were the lowest among the other parameters measured, with 83% and 66%, respectively.An improvement to 0.2% was observed for the false positives when including information from the HECM the year before and after recruitment.

Coronary disease
Twenty four percent of the patients identified in the REIN database were recorded as having coronary disease.Fifteen percent of the population differed on coronary disease status, of which two thirds of the disparate patients were false positives (Table 2).The 872 unmatched patients based on coronary disease status were more likely to be patients who died early or started treatment with dialysis (Table 3).The sensitivity was 79% and specificity 87%.The Kappa coefficient of agreement between the REIN and SNDS databases on coronary disease was 62%.The level of agreement improved to 75% and 69% when considering the information from 2017 and 2019 from the HECM, respectively.

Discussion
In this study, we compared the information on patient conditions between the REIN registry collected based on clinical data and the HECM algorithm based on health consumption reimbursement data.The agreement between diagnoses as identified by the REIN and the SNDS varied between conditions, with the highest for diabetes and the lowest for cancer.Specificity was above 85% and the PPV over 95% for all conditions, suggesting overall good performance of the HECM algorithms in identifying the conditions of interest in this study.

Ease of diagnosis
Pathologies with tracer drugs or tracer medical acts are better identified in medico-administrative refund information databases [12,13].The Kappa coefficients for the status of diabetes and HIV were higher than those for coronary disease and cancer.These comorbidities identified in the SNDS database are treated with medications that are specifically used for the disease, allowing us to identify patients whose LTD registration or hospitalization diagnoses are not reported.Coronary disease as a medical diagnosis is slightly more difficult to identify in the SNDS database, as it relies on discharge records for patients hospitalized during the given period or an LTD reported in the four years before the year of interest.There are no specific drugs or medical procedures that are integrated into the HECM that can help identify patients who do not comply with the specified conditions.The REIN database benefits from direct patient interviews and medical records to record information on these conditions.The definition of active cancer in the HECM is based on patients with a reported LTD and hospital diagnosis during the year.These definitions could lead to an underestimation of patients who either did not receive treatment or whose treatment was received in ambulatory care (whose LTD is not reported for the year of interest).As an example, a patient receiving antiestrogen therapy for breast cancer treatment in an ambulatory setting, without hospitalization associated with the reported disease and no LTD reported could be missed by the HECM tool [14].On the contrary, the REIN database reports active cancer regardless of the patients' current treatment status.These differences in definition could explain some of the false negative patients.
The sensitivity and specificity were high (> 80) for most of the assessed diseases, except the sensitivity for coronary disease.This high level of sensitivity suggests that the HECM tool is able identify patients with a disease (unlikely to produce false negatives).High specificity was seen for all comorbidities assessed, suggesting a low number of patients being categorized as having the condition when they do not (false positives).

Timelines
In comparing these databases, we should consider that the REIN database collects information prospectively and that the HECM categorizes diseases retrospectively for a given year.The identification of patients with ESKD improved when adding information from the year 2019.A great number of patients with mismatched ESKD status were found to be patients coded in the SNDS as acute kidney disease in the REIN incident year.These may have been patients with chronic kidney disease but who started chronic dialysis after an acute episode who did not fulfill the required time under treatment to be classified as ESKD by the end of the year of interest.As well, despite the work of the REIN registry's research assistants, whose mission is to check the completeness of the cases and compliance with the protocol, we cannot rule out a few marginal errors.
Concerning false negatives for diabetes, a patient identified in the REIN database in December as being diabetic that did not fulfill the requirements to be identified by HECM (e.g., needs 3 antidiabetic drug deliveries to be identified through medication) for that year would have resulted in a mismatch.A patient identified in the REIN database in January as a patient without cancer might have developed the disease later in the same year and the HECM tool would register them as positive for cancer in that same year, resulting in a false positive.For, coronary disease, we observed better performance when data for the year 2017 was added.This could be a result of HECM considering data for the four years prior to the year of interest to classify patients, therefore, including adding information for patients from 2015.

Patient characteristics
We explored the characteristics between the unmatched and matched populations for each condition.We found a higher proportion of early deaths, first RRT with dialysis, males, and residents of Ile-de-France among the unmatched population.Patients with short survival would not have the opportunity to have their record corrected in the REIN database and in the SNDS, they may not have had sufficient healthcare consumption to be identified.First RRT treatment with dialysis and residency in Ile-de-France were the biggest subgroups for which linkage was more likely to be less precise.The Ile-de-France region is a densely populated region were patients could easily mobilize between the different facilities [8].Patients could start their treatment at an ICU (recorded in the SNDS) in a postal code and later transferred to a less medicalized center elsewhere (recorded in the REIN database).The prevalence of a disease in a population can influence the PPV and NPV.When prevalence increases the PPV increases but the NPV decreases [15,16].In this population, the prevalence of diabetes, coronary disease, and cancer was higher than in the general population (prevalence estimated to be 5.88%, 3.11%, and 4.98% in 2018, respectively, for the general French population [17]).These accuracy parameters (PPV and NPV) may, therefore, not be replicable in the general population.

Strengths and limitations
The strengths of this study were that it used two national databases in which comorbidities are identified by two different methods.However, this study also had several limitations.First, even though the parameters to categorize a patient as having or not a condition are comparable between both databases they are not identical.Therefore, certain patients' conditions could be disparate eeven when correctly categorized in both databases.Unfortunately, among the 58 conditions of the HECM, only 5 had similar identification method with REIN.Many medical conditions explored by the HECM are not collected in REIN like precise cancer location or psychiatric disorder or neurodegenerative disease.We recognise that the results observed for these 5 diseases would have been significantly poorer if we had used diseases whose identification method initially differed.
Second, for legal reasons the databases used do not have a shared unique identifier for patients and therefore relied on a direct deterministic algorithm to link patient information between them.Even when only including patients with a good linkage, there might have been certain patients who were imperfectly linked.The choice to keep only patients with a good match led to the exclusion of 2,338 patients.It seemed to us that in the case of our objective, this did not constitute a bias but may reduced the scope of the extrapolation of our comparison.
Third, HECM algorithms were designed for medicoeconomic rather than epidemiological purposes.As such, they do not aim to collect the exhaustive number of incident cases over one year, as economists are generally more interested in the longitudinal evolution of healthcare expenditure and consumption, observed on specific samples.The pathologies categorized by the algorithm are based on short periods, with individuals not taken into account when they are treated at the beginning or end of the year.This may explain the improvement in performance when the search was extended to the years 2017 and 2019.On the other hand, despite the fact that completeness and accuracy are ascertained by REIN research assistants during regular visits in every dialysis centre, and update at each annual visit, we may not exclude coding error in transcription from medical record.
The REIN database included only patients with ESKD, representing only a small proportion of the French population.Therefore, the generalisability of the results to other populations should be explored.Other French registries have successfully linked most of their patients (all over 85%) to the SNDS database: CONSTANCES, FRESH HR, ACIRA, France-TAVI, CANARI [18][19][20][21].These linkages have been used to enrich the databases of the registries and could potentially be used as a starting point to further validate the HECM tool.

Conclusion
The development of tools that allow the use of medicoadministrative databases for epidemiological research is of great important, as they provide information at the national level, limiting the costs and time required for more traditional data collection methods.The HECM algorithm matched the information provided by the REIN database with that of the SNDS database relatively well.However, further validation of the HECM tool on other populations should be performed.

Table 1
Description and comparison between the included and excluded populations in relationship to the linkage between the SNDS and REIN database

Table 2
Comparison between patient comorbidities in the two databases

Table 3
Characteristics of the matched and unmatched populations by disease