Evaluating the relationship between clinical and demographic characteristics of insulin-using people with diabetes and their health outcomes: a cluster analysis application

Background The aim of this study was to determine how clusters or subgroups of insulin-treated people with diabetes, based upon healthcare resource utilization, select social demographic and clinical characteristics, and diabetes management parameters, are related to health outcomes including acute care visits and hospital admissions. Methods This was a non-experimental, retrospective cluster analysis. We utilized Aetna administrative claims data to identify insulin-using people with diabetes with service dates from 01 January 2015 to 30 June 2018. The study included adults over the age of 18 years who had a diagnosis of type 1 (T1DM) or type 2 diabetes mellitus (T2DM) on insulin therapy and had Aetna medical and pharmacy coverage for at least 18 months (6 months prior and 12 months after their index date, defined as either their first insulin prescription fill date or their earliest date allowing for 6 months’ prior coverage). We used K-means clustering methods to identify relevant subgroups of people with diabetes based on 13 primary outcome variables. Results A total of 100,650 insulin-using people with diabetes were identified in the Aetna administrative claims database and met study criteria, including 11,826 (11.7%) with T1DM and 88,824 (88.3%) with T2DM. Of these 79,053 (78.5%) people were existing insulin users. Seven distinct clusters were identified with different characteristics and potential risks of diabetes complications. Overall, clusters were significantly associated with differences in healthcare utilization (emergency room visits, inpatient admissions, and total inpatient days) after multivariable adjustment. Conclusions This analysis of healthcare claims data using clustering methodologies identified meaningful subgroups of patients with diabetes using insulin. The subgroups differed in comorbidity burden, healthcare utilization, and demographic factors which could be used to identify higher risk patients and/or guide the management and treatment of diabetes. Supplementary Information The online version contains supplementary material available at 10.1186/s12913-021-06603-0.


Introduction/background
Diabetes is a complex, chronic illness that affects about 30 million people or 9.4% of the population in the United States [1,2]. The population with diabetes continues to grow and the percentage of adults with diabetes increases with age, where approximately 25% of all adults over 65 years of age have diabetes [3]. Diabetes remains the seventh leading cause of death in the United States with approximately 80,000 death certificates in 2015 listing diabetes as the underlying cause of death [3], although the actual number of diabetes-associated deaths may be higher as diabetes is often underreported as a cause of death [3].
The majority of people with diabetes are classified as having type 2 diabetes mellitus (T2DM), occurring in about 90 to 95% of diagnosed cases [4,5], with approximately 5% of diagnosed cases classified as type 1 diabetes mellitus (T1DM) [4]. T1DM is characterized by the lack of insulin production, and as such patients require insulin to survive [5]. T2DM is characterized by the body's resistance to insulin action in addition to a relative insulinopenia [6], and is usually diagnosed in adults. Genetic factors and lifestyle play an important role in disease progression for people with either T1DM or T2DM [4].
People with diabetes visit physician offices and emergency rooms (ER) more frequently than people without diabetes, are more likely to be admitted to the hospital, and more commonly receive home health care [7]. The financial burden of diabetes to individuals and society is estimated to be a total of $327 billion, including $237 billion in direct costs and $90 billion in reduced productivity in 2018 [3,7]. Healthcare expenditures are about 2.3 times higher for people with diabetes than people without diabetes with an average medical diabetesrelated expenditures of approximately $9600 per year in the United States [7].
Although all people with T1DM use insulin, insulin use varies among people with T2DM. Most are initially managed with oral hypoglycemic agents but over time, most will ultimately utilize insulin. The majority of the diabetic population uses injection-based administration with insulin pens or vials and syringes while the remainder uses insulin pumps (continuous subcutaneous insulin infusion) [8,9]. Since 2005, there has been a significant increase in the use of insulin pens while the use of vials and syringes to deliver insulin has decreased over time [10]. Insulin pens have simplified the administration of insulin, resulting in more accurate and easier delivery of insulin relative to vials and syringes [11]. Pumps deliver rapid-acting insulin throughout the day. Diabetes-related technologies are available to monitor blood glucose and include the standard blood glucose monitor (single reading) and continuous blood glucose monitoring systems (real time and intermittently scanned) [8].
In 2005-2012, among patients who had any insulin use, only 31.4% had an hemoglobin A1c (HbA1c) < 7%; therefore, as glycemic control is still not attained by most people with diabetes, there is a need for new approaches to identify subgroups of people with diabetes (T1DM or T2DM) who might have risk factors that could impact treatment decisions and targeted disease management [12]. Given the differences across the US population especially in disease severity and utilization of healthcare, we sought insight from a large, real-world database utilizing Aetna's administrative claims. In this study, we used clustering techniques to identify subgroups of people treated with insulin based upon healthcare resource utilization, select social demographic and clinical characteristics, and diabetes management parameters. We then related these subgroups to health outcomes, including ER visits, hospital admissions, and total inpatient days.

Study design and data sources
We used a non-experimental, retrospective design in this study utilizing Aetna's administrative claims data containing membership, eligibility, medical claims, pharmacy claims, laboratory test results, and data derived for Aetna's care management processes (Health Profile Database [13]) for Aetna fully insured Commercial and Medicare Advantage members, with services dates from 01 January 2015 to 30 June 2018. All of the data used in this study were fully de-identified and Health Insurance Portability and Accountability Act compliant.

Sample selection and patient population
Using the International Classification of Diseases Ninth and Tenth Revision (ICD9-CM v1 and ICD10-CM) diagnostic codes, the study included adults 18 years of age and older who had a diagnosis of T1DM or T2DM who utilized insulin therapy and had Aetna medical and pharmacy coverage for at least 18 continuous months (6 months of coverage prior and 12 months after their index date, defined as either their first insulin prescription fill date or their earliest insulin fill date allowing for 6 months prior coverage; Fig. 1). People with diabetes were excluded from the analyses if any of the following criteria were met during the entire study period: had ≥1 inpatient or outpatient medical claim with a diagnosis in any position of gestational diabetes, steroid-induced diabetes, or metastatic cancer; had indications of hospice use; or were enrolled in Aetna's Compassionate Care Program. The study was approved by the Sterling Institutional Review Board (Atlanta, Georgia, USA).

Statistical analysis
Descriptive analyses were performed initially on the overall study population and for T1DM and T2DM separately. After clustering, descriptive analyses were completed for each cluster. Descriptive statistics were generated for all demographic characteristics and utilization measures as applicable to the type of variable. Continuous variables were described using means with standard deviations (SD) or medians with first and third quartiles if data were highly skewed. Categorical and binary variables were described using counts and percentages. Healthcare utilization were reported as means and SD and number of patients with ≥1 visit.

Cluster analysis
A cluster analysis was performed on 13 pre-period variables that were hypothesized to be both related to outcomes and to the use of diabetes technology (blood glucose monitor [BGM], continuous glucose monitoring [CGM], or insulin pumps). Variables were identified by the study team based on clinical opinion and were limited to variables found in the administrative claims database. They included age, number of endocrinologist visits, diabetes complications severity index (DCSI) [14], Charlson comorbidity index (CCI) [15], number of HbA1c tests, number of months on insulin, number of ER visits, total number of inpatient days, number of discordant comorbidities, number of concordant comorbidities, number of medical claims, proportion diabetesrelated claims, and estimated income. Concordant comorbidities were defined as those that share the same pathophysiologic risk profile and management plan as with diabetes, whereas discordant conditions are those that are not directly related to diabetes in either their pathophysiology or management [16]. As the underlying pathophysiology differs for T1DM and T2DM, concordant and discordant were defined separately for each type of diabetes (see Supplemental Materials Table S1). Clinically dominant comorbidities, those that are so complex or serious that they tend to eclipse the management of other conditions [16] such as end stage renal failure or dementia, were considered but not found in the study population. Claims were considered diabetes related if they contained a diagnosis for diabetes in any position. All 13 cluster variables (mentioned above) had significant bivariate associations with utilization of diabetes technology and were standardized according to the type of variable. We opted to use variables associated with the utilization of diabetes technology in the cluster analysis to identify unexpected patterns in the characteristics of patients who utilized different types of technology. After clustering, utilization of diabetes technology along with other relevant variables such as diabetes type, insurer, and HbA1c levels (when available, for 31.2% of the population) were reported to help characterize the clusters.
We utilized K-means methodology to identify the clusters based on the following 13 variables: age, number of endocrinologist visits, DCSI, CCI, number of HbA1c tests, number of months on insulin, number of ER visits, total number of inpatient days, number of discordant comorbidities, number of concordant comorbidities, number of medical claims, proportion diabetes-related claims, and estimated income. To determine the optimal number of clusters, we used the Jump method [17], a non-parametric method for choosing the number of clusters based on "distortion" which is a measure of within-cluster dispersion. We chose the Jump method as it is a simple and readily available method that has performed well against other competing methods in simulated data analyses [17].
To identify which variables played a key role in the formation of clusters and to obtain simple descriptive rules for clusters, classification and regression trees (CART) were used with cluster assignment as the "outcome" variable and all variables used in clustering as the "predictors." This analysis was performed using the Ctree function in the R package partykit.
Outcomes of interest, measured in the 12-month follow-up period, included: all-cause ER utilization, allcause inpatient hospitalization, and total inpatient days.
To determine if the clusters were associated with the outcomes measured during the study, multivariable generalized linear models were applied with clusters as the independent variables and other covariates or interactions to account for confounding. The generalized linear models included binomial distribution and logit link function (ER and inpatient hospitalization outcomes) and negative binomial distribution with log link function (total inpatient days outcome). For the multivariable regression analyses, we used a stepwise selection methodology (or backward elimination) with variable removal when p ≥ 0.05. Akaike Information Criterion was also used to determine the best fit model. All statistical analyses were conducted using SAS Enterprise Guide Version 6.1 (SAS Institute Inc., Cary, NC, USA) or R version 3.2.5 (R Foundation for Statistical Computing, Vienna, Austria).

Study population
A total of 100,650 insulin-using people with diabetes met the study criteria, including 11,826 (11.7%) with T1DM and 88,824 (88.3%) with T2DM. Table 1 shows the pre-period demographic characteristics for the total population and differences between people with T1DM and T2DM. The majority of study patients (78.5%) were existing insulin users at the index date, whereas the remaining 21.5% were new insulin initiators. The mean age (SD) was 62.2 (14.2) years with a mean age (SD) of 45. 3  Overall, the mean (SD) CCI score was 1.9 (1.6) and the mean DCSI was 1.1 (1.5), both of which were lower for those with T1DM (Table 1). A total of 60,668 people (60.3%) had one or more HbA1c test performed and the average HbA1c value was 8.8% among those individuals (31.2%) with values in the database. Approximately 6.4% of the population used an insulin pump (37.9% T1DM and 2.2% T2DM) and 67.5% used a pen (38.7% T1DM and 71.4% T2DM). Blood glucose meters were utilized by approximately one-third of the population (33.3%). On the other hand, CGMs were only used by 3.0% overall, where T1DM patients had greater use compared to those with T2DM (19.2 and < 1%, respectively).
The healthcare utilization over the 6-month preperiod is shown in Table 2. The majority of the population had primary care physician (PCP) visits (72.3%), but the proportion with endocrinologist and cardiologist visits was lower at 21.3 and 17.8%, respectively. Inpatient hospitalizations occurred for 14.0% of the population and 18.8% had ER visits. Overall, people with T1DM had lower utilization of all-cause and diabetes-related healthcare services than people with T2DM, with the exception of endocrinologist visits.

Cluster formation and analysis
We identified seven clusters of people with diabetes, which had distinguishable characteristics and risk factors. Characteristics of the seven identified clusters are shown in Table 3. Seven of the 13 clustering variables studied using the CART analysis were identified as being the most important factors for cluster formation. These included number of endocrinology visits, total inpatient days, concordant comorbidities, number of ER visits, comorbidity burden as measured by CCI and DCSI scores, and percentage of diabetes-related medical claims. Figure 2 illustrates the CART analysis after the clusters were formed.
The seven clusters fell into three main groupings in hierarchical order (Fig. 3): those with endocrinology visits (Clusters 2 and 6) with differing concordant burden profiles; those with high acute care utilization (Clusters 4 and 5) with differences in inpatient days and ER visits; and those with low ER utilization (Clusters 1, 3, and 7) with varying comorbidity burden, diabetes utilization, and complication profiles.

Endocrinology utilizers clusters
Cluster 2 all had at least one endocrinologist visit (mean = 1.8 visits/patient) and the average number of concordant comorbidities was 4.6. Cluster 6 had the youngest mean age (38.3 years) and the second highest utilization of endocrinologists (mean = 0.6 visits/patient) Abbreviations: BGM = blood glucose monitoring; CGM = continuous glucose monitoring; HbA1c = hemoglobin A1c; n, N = number of people with diabetes; Q1 = first quartile; Q3 = third quartile; SD = standard deviation; T1DM = type 1 diabetes mellitus; T2DM = type 2 diabetes mellitus; US = United States 1 Pre-period was defined as 6 months prior to index date 2 Proportion is set to zero for those members who had zero medical claims overall during the baseline period 3 Diabetes-related determined by diagnosis in any diagnostic position on a medical claim young mean age, nearly all patients in Cluster 6 had commercial insurance. The highest proportion of Medicare Advantage members (and oldest mean age) was in Clusters 1 and 4 (> 80%). The highest mean HbA1c value (among the subset of the population with available values) was in Cluster 3 (9.4%). On the other hand, the lowest mean HbA1c of 8.3% was observed in Cluster 4, which also had the highest proportion without prior insulin use in the pre-period and the lowest mean number of months where insulin was filled.

Multivariable modeling analysis
For the multivariable modeling analysis, three key outcomes were chosen including all-cause ER visit (Fig. 4), all-cause inpatient hospitalization (Fig. 5), and total inpatient days (Fig. 6) during the follow-up period. Across all models, clusters were independently associated with the outcomes of interest after controlling for covariates that were potentially related to cluster assignment and/ or outcomes of interest. Clusters 1, 2, 4, and 5 had higher odds of an ER visit, whereas Cluster 6 had lower odds compared to Cluster 7. Clusters 2, 4, and 5 also had higher odds of having a hospital admission. If hospitalized, Cluster 4 had significantly increased total inpatient days compared to Cluster 7.

Discussion and conclusions
The aim of this study was to determine how clusters or subgroups of insulin-treated people with diabetes, based upon healthcare resource utilization, select social demographic/clinical characteristics, and diabetes management parameters, are related to health outcomes including acute care (ER and hospital inpatient) visits and total inpatient days. We did this study to help identify groups of patients that may be amenable to emerging diabetes management technologies. In this study, we identified seven clusters of insulin-treated people with diabetes, which have different patterns of healthcare      utilization and diagnosed comorbidities in a large healthcare claims database. The most important factors in defining the clusters were the number of endocrinology visits, total inpatient days, concordant comorbidities, number of ER visits, comorbidity burden as measured by CCI and DCSI scores, and percentage of diabetes-related medical claims. Multivariable modeling showed that these clusters are significantly associated with ER visits, inpatient hospitalizations, and total inpatient days, suggesting that this approach may help identify patients at greater need for targeted disease management efforts at the population level. The clusters also offer providers clinically relevant information regarding treatment decisions for a patient population with diabetes. Cluster analyses can reveal how variables, in our case administrative claims from people with diabetes, are related in complex datasets. The use of cluster analyses in healthcare decision making is still relatively uncommon but appears to be gaining acceptance [18][19][20][21][22][23]. Our work builds upon a few previously published cluster analyses in diabetes, which focused on readiness of CGM and other diabetes-related devices, self-management patterns in a pediatric population, factors influencing people with diabetes who have poorly controlled conditions, and identifying diabetes phenotypes [23][24][25][26][27]. These previous studies involved smaller numbers of participants from relatively homogeneous populations (e.g., T1DM registry) and/or more controlled conditions (e.g., clinical trial). In contrast, our study used a large healthcare claims database to evaluate whether routinely available data could identify relevant subgroups of insulin users.
Not only did clusters differ with respect to the specific variables used to form them (by design), but also on important other characteristics. Clusters 2 and 6 were formed primarily based on the use of endocrinologists. Not surprisingly, these were the clusters with the highest proportion of people with T1DM as well as utilizers of diabetes technology (pump, BGM, or CGM). Those in Cluster 2, however, had higher comorbidity burden and mean number of HbA1c tests than those in Cluster 6, but there was little difference in mean HbA1c values for these two clusters.
Two clusters were identified with high levels of acute care utilization. Those in Cluster 4 had the highest total inpatient days and everyone in Cluster 5 had an ER visit. These clusters differed, however, in their comorbidity burdens and glycemic control. Interestingly, the lowest mean observed HbA1c value was for Cluster 4, with the highest levels of overall medical utilization (median of 43.0 claims), acute care utilization (99.1% had an inpatient hospitalization), and highest CCI and DCSI scores. These results could suggest that a high burden of comorbidities or diabetes complications and increased interactions with hospitals facilitated more intensive diabetes management. However, because HbA1c values were only available on a subset of the study population (approximately 30%), additional analyses on datasets with more complete HbA1c data are needed to confirm this finding.
Conversely, higher mean HbA1c's were observed among Clusters 3, 5, and 7 (in order of highest to lowest values). Clusters 3 and 7 differ from Cluster 5 in that they fell into the low utilization grouping (both acute care and overall utilization via number of medical claims) and had among the lowest CCI and DCSI scores. They differed from each other in one key aspect: the proportion of medical claims that were diabetes-related. Approximately three-fourths of the claims for Cluster 3 were related to diabetes, compared to less than 20% in Cluster 7. Because the CCI and DCSI scores are derived from the presence of diagnosis codes in claims data, on one hand it is not surprising for these clusters who have the lowest overall number of medical claims to have the lowest scores due to fewer opportunities to derive those diagnoses. But, on the other hand, the lack of diagnoses of comorbidities in the observed claims or lack of encounters altogether could suggest a healthier underlying population. Either way, the relatively high observed HbA1c values along with the low rates of interactions with healthcare providers suggested suboptimal diabetes self-management.
The current study demonstrated that even after adjusting for other covariates, cluster assignment was significantly predictive of future outcomes. Specifically, cluster assignment was associated with the likelihood of experiencing an ER or hospital inpatient visit and the total number of inpatient days for those with an admission. These results suggest that the specific combination of variables used in the cluster formations shed additional light onto the risk of untoward outcomes above and beyond traditional risk stratification, for example, based upon parameters including diabetes type, age, and HbA1c. Furthermore, as these clusters were derived from variables routinely found in healthcare claims data where detailed clinical data are often missing, this approach could aid healthcare payers with population management efforts. We found some clusters utilizing less healthcare resources had higher observed mean HbA1c levels. This finding could suggest population management efforts in diabetes that are targeted at some of the lower healthcare utilizers in efforts to improve glycemic control, which could yield better long-term health outcomes for patients and improved quality metric ratings for providers and payers.
This study has limitations that should be considered. The cluster analysis was based on administrative claims data that rely on diagnostic codes entered by the healthcare provider for billing purposes. As such, they are proxies for clinical outcomes and may be prone to data coding errors or inaccuracies in patient records. Within this particular population, patients may have diagnostic codes for both type 1 and type 2 diabetes during the course of the study period, and while our algorithm to classify patients considered the preponderance of diagnostic codes among all available, in conjunction with treatment patterns common to each type of diabetes, it did not preclude the possibility of misclassification. Additionally, as of October 2015, all claims switched from ICD-9 to ICD-10 (study period January 2015 to June 2018) [28]. This switch possibly could have led to inaccuracies in coding due to unfamiliarity with the new system or mistakes in cross walking codes from ICD-9 to ICD-10 by providers. This potential for error should have had limited impact since condition identification was based on families of codes with multiple codes within a family and not on any single code. The comparisons of HbA1c values were incomplete as only a subset of patients (~30%) had values in the database. A number of relevant risk factors, including insulin dosing, diet, and exercise, were not available in the database.
Despite the limitations, this study was based on a well-developed study design and included data from a large number of insulin-using people with either T1DM or T2DM. Evaluating the impact of patientreported outcomes and more socioeconomic data on cluster formations would be of interest to study in the future.

Conclusion
In conclusion, we demonstrated that clustering analyses of healthcare claims data identified meaningful subgroups of patients that differed in comorbidity burden, healthcare utilization, and demographic factors. These clusters were found to be significantly associated with future outcomes indicating that providers and population health managers may be able to better estimate risk, based upon combinations of specific variables, and modify and/or personalize treatment accordingly.