Skip to main content

COVID-19 severity scale for claims data research



To create and validate a methodology to assign a severity level to an episode of COVID-19 for retrospective analysis in claims data.

Data Source

Secondary data obtained by license agreement from Optum provided claims records nationally for 19,761,754 persons, of which, 692,094 persons had COVID-19 in 2020.

Study Design

The World Health Organization (WHO) COVID-19 Progression Scale was used as a model to identify endpoints as measures of episode severity within claims data. Endpoints used included symptoms, respiratory status, progression to levels of treatment and mortality.

Data Collection/Extraction methods

The strategy for identification of cases relied upon the February 2020 guidance from the Centers for Disease Control and Prevention (CDC).

Principal Findings

A total of 709,846 persons (3.6%) met the criteria for one of the nine severity levels based on diagnosis codes with 692,094 having confirmatory diagnoses. The rates for each level varied considerably by age groups, with the older age groups reaching higher severity levels at a higher rate. Mean and median costs increased as severity level increased. Statistical validation of the severity scales revealed that the rates for each level varied considerably by age group, with the older ages reaching higher severity levels (p < 0.001). Other demographic factors such as race and ethnicity, geographic region, and comorbidity count had statistically significant associations with severity level of COVID-19.


A standardized severity scale for use with claims data will allow researchers to evaluate episodes so that analyses can be conducted on the processes of intervention, effectiveness, efficiencies, costs and outcomes related to COVID-19.


A novel COVID-19 severity scale is developed for researchers who use claims data.

Costs increase with severity of COVID-19 episodes.

Demographic factors such as age, race, geographic region, and comorbidity count had statistically significant associations with severity level of COVID-19.

Peer Review reports


COVID-19, the illness caused by Severe Acute Respiratory Syndrome (SARS) coronavirus 2, arose as a global pandemic in 2020, continuing into 2021. Symptoms of the disease range widely from none to death, and result in high demand on health care providers, hospitals and resources, such as emergency care services and inpatient beds, intensive care beds and complex life-saving equipment. Correspondingly, the median costs associated with COVID-19 treated cases were estimated by Bartsch in 2020 to range from $3,994 for mild symptomatology to $18,579 for hospitalized cases [1]. A later study by Tsai et al. used Medicare claims revealing higher costs for hospitalization, at a mean of $21,752 up to $49,441 if mechanical ventilation was required [2]. Several studies have shown that patient characteristics such as age, gender, and comorbidities have impacted both the risk for infection as well as the severity of illness [3, 4, 5].

The World Health Organization (WHO), working with the International Forum for Acute Care Trialists and the International Severe Acute Respiratory and Emerging Infections Consortium, developed a WHO Clinical Progression Scale to measure the viral burden of COVID-19 and to assess the patient trajectory and resources used over the course of COVID-19 [6]. Other scoring instruments have been developed for emergency department triage or prediction of mortality [7]. These tools are primarily utilized for identification of patient risk to optimize inpatient management of patients.

To date no known tool exists to assign a severity level to an episode of COVID-19 for purposes other than patient management. We propose a reliable and reproducible method to allow researchers to evaluate episodes as opposed to acute hospitalization so that analyses can be conducted on the processes of treatment intervention, treatment effectiveness, treatment efficiencies, and health care costs and outcomes related to COVID-19.

Healthcare claims data are available through payor sources such as health insurance carriers, Medicare and Medicaid, as well as from companies that aggregate, de-identify and license use of large administrative claims databases. Health claims data provide valuable information on insured persons across time regardless of provider or provider group. Specific diagnoses and procedures are documented for services and exist in the data both historically and linked to treatment events by date. Thus, claims data allow for identification of specific individuals or cohorts who meet designated medical or demographic criteria, allowing the researcher to evaluate co-morbidities, utilization patterns, costs of services, define episodes of care and measure outcomes both individually and population based. Claims data research has been used effectively for policy analyses and to inform population health initiatives.

Claims data are, however, subject to a time lag, whereby the provider submission of a claim for reimbursement follows the actual service, and the adjudication of the claim by the payor is a process that also delays data. Typically, claims data are added to the database when the claim is processed and paid by the carrier, so often are incomplete until 90 days after submission. As the COVID-19 pandemic began in the United States in early 2020 and continues to date, the full year of 2020 claims were determined to be available in April 2021 for analysis.


The database used for the development of this scale was Optum’s Clinformatics® Data Mart (CDM) which is derived from administrative health claims for members of large commercial and Medicare Advantage health plans (Optum® de-identified COVID-19 Electronic Health Record dataset (2007–2020). The database includes approximately 19 million annual covered lives, for a total of over 65 million unique lives over a 9-year period (1/2007 through 12/2020). Clinformatics® Data Mart is statistically de-identified under the Expert Determination method consistent with HIPAA and managed according to Optum® customer data use agreements. CDM administrative claims submitted for payment by providers and pharmacies are verified, adjudicated and de-identified prior to inclusion. These data, including patient-level enrollment information, are derived from claims submitted for all medical and pharmacy health care services with information related to health care costs and resource utilization. The population is geographically diverse, spanning all 50 states.


The COVID-19 scale was applied to the national claims data in the Optum CDM for all medical claims in 2020. Of the 19,761,754 total unique persons with enrollment information in the database in 2020, 692,094 (3.5%) met the criteria for one of the severity levels based on diagnosis codes. The age distribution was as expected with infection rates generally rising with increasing age (Table 1).

Table 1 Percentage of persons in each age group with a COVID-19 diagnosis

As shown in Table 2, over half of all patients –(60%), fell into Severity Level 2 – a confirmed diagnosis of COVID-19 but asymptomatic and ambulatory. Another 14% remained ambulatory (Level 3), resulting in 72% of diagnosed cases that did not require a higher level of care. 12% utilized the emergency department (Level 4) but did not require admission, and the remaining 12% (Levels 5–9) were hospitalized at various levels of severity. Slightly more than 2% died during hospitalization (Level 9). The rates for each level varied considerably by age group, with the older ages reaching higher severity levels (p < 0.001). Other demographic factors such as race and ethnicity, geographic region, and comorbidity count had statistically significant associations with severity level of COVID-19 (Table 2).

Table 2 Distribution of severity by age, gender, race, region, comorbidities and cost

Costs also varied significantly by severity level, which was an expected finding as the levels were related to intensity of resource use which drive costs. Table 3 presents the mean and median costs per person in relation to severity level. Patients who reached severity level 8 and survived incurred the highest average costs related to COVID-19 at a median of $197,007. The highest severity level 9 had lower median costs than the level below, which is likely explained by the death in hospital resulting in shorter lengths of stay. In Table 4, the gamma regression analysis shows how each predictor increases compared to the reference level while holding all other variables constant. The regression shows a significant relationship between cost and severity, in that more severe cases predicted higher costs while controlling for the other factors. The most pronounced difference can be found between severity levels 3 and 4. Severity level 4 is exp(1.86): 6.42 times the cost of level 3. Similar trends were observed in higher levels such as 6.05 times the cost in level 5 compared to level 4, and 2.53 times the cost in level 8 compared to level 7.

Table 3 Mean and Median cost of care by severity level for patients with COVID-19
Table 4 Generalized Gamma Regression of Cost and Severity


This study relied on claims data to identify persons with confirmed COVID-19. Confirmation of COVID-19 was determined by a diagnosis code on a claim submitted by a provider for medical or pharmacy services. The prevalence of confirmed COVID-19 cases at 3.6% was lower in the claims data than generally reported for the population. Sen et al. reported that 33% of the US population had been infected by the end of 2020 yet only 11.8% were documented, providing a comparative estimate of 3.9% of the population with documented infection [8]. The lower rate may be due to several factors including (1) testing not recorded with a health care claim, (2) diagnosis not assigned on testing claim, (3) cases confirmed through non-billed sources, such as public health agencies resulting in undercounting due to care received without a related bill for service [9], (4) selection bias in studying only persons with commercial health insurance [9].

The COVID-19 pandemic impacted the health care system through excessive demands on resources such as intensive care facilities, over-taxed capacity of hospitals, and shortage of health care providers which may have influenced the progression of an individual’s disease severity. These factors could not be controlled for in the model for evaluation. An additional limitation was the issue related to high cost claimants whose claims were allocated to a stop-loss account once an annual limit was reached and subsequent charges were not reported. This may have resulted in lower observed costs than actual costs incurred, however, the number of affected individuals was small and any resulting bias in cost estimates likely was minor.

The recorded average costs are consistent with the studies by Bartsch and Tsai, with ambulatory patients incurring costs less than $3,000 and increasing for hospitalized cases with wide variation in overall mean of $21,752 to $47,441 for Medicare patients [1, 2]. What has not been demonstrated previously is the extent to which health care costs are driven by the most severely affected patients. As the present index was created based on the intensity of medical interventions, which are directly related to costs. What was not appreciated prior to this evaluation was the exponential shape of the cost curve, with the small number of patients receiving the most intensive care driving overall costs.

We believe that the COVID-19 scale would be useful for further research on both clinical and financial impacts of the disease. Severity Level 1 has limited utility for a cost analysis because it represented a small number of patients with limited information. However, the authors believe that this level may have potential value in other contexts, such as the exploration of long-term outcomes (i.e. “post-acute COVID”) and its impact on comorbid conditions.

The authors present this scale for application by researchers who use claims data to evaluate the impact of the COVID-19 pandemic on individuals, populations, and on policy. Standardization of a measure of severity would allow easier comparison of results across studies and facilitate a determination of reproducibility of findings in various settings and populations. The scale can be implemented in any claims-based dataset such as those maintained by health plans, researchers, and health systems with claims based records. It is relevant across age groups, sex, and payor groups (i.e. Medicare, Medicaid, commercial, etc.). Future COVID-19 research will likely include analyses of the severity of COVID-19 events and the impact on continuing symptoms or complications. Additionally, over time, the value of Level 1 may increase as COVID-19 cases are less frequently documented by a provider and self-report increases. Finally, in the future the use of the severity scale may benefit from the addition of information on vaccination history for which various codes have been created.

To build the logic for the claims-based COVID-19 Severity Scale (referred to as the COVID-19 Scale) we modeled it upon the design of the WHO Progression Scale. For hospitalized patients, this scale relied on clinical values documented in medical records, with a special focus on oxygen levels (FiO2 and pO2) and use of mechanical ventilation, renal dialysis and extracorporeal membrane oxygenation (ECMO) as key measures for patients at the highest levels of severity [6]. From data measured over the course of treatment, the WHO scale identified 10 levels of patient progression as follows: Score 0: Uninfected, Scores 1, 2, and 3: Ambulatory mild disease, Scores 4 or 5: hospitalized: moderate disease, Scores 6,7,8,or 9: Hospitalized: severe diseases and Score 10: Dead.

Since documentation of oxygen levels (FiO2 and p02) is not available in claims data, we modeled similar endpoints as used in the WHO scale, using information that is routinely available within retrospective claims data. The modified measure is intended to be used as an index of episode severity as opposed to treatment progression. Endpoints commonly used included symptoms, respiratory status, progression to levels of treatment (ambulatory, emergency department, inpatient admission), and mortality. For patients who required oxygen therapy we relied upon documentation of the use of various levels of respiratory treatments, with specific focus on mechanical ventilation. Renal dialysis and ECMO are well documented in claims data and were incorporated for patients who required these additional treatments. Death is not always well documented in claims data, unless it occurs in an inpatient setting, for which discharge status is coded as “expired.”

The strategy for identification of cases between January 1, 2020 and April 2020 before an official ICD-10 diagnosis code was issued for COVID-19 relied upon the February 2020 guidance from the Centers for Disease Control and Prevention (CDC) that combined codes for respiratory infections with code B97.29 – other coronavirus [10]. Another challenge was evident in that many persons during the pandemic developed a presumptive COVID-19 infection without a confirming diagnosis or a medical claim submitted by a provider for detection or treatment. Additionally, individuals began to experience sequelae of COVID-19 without an earlier diagnosis in the claims data. For these cases, the CDC published additional guidance with ICD-10 codes for “personal history of COVID-19” or “sequelae of COVID-19” [11]. The publication of the ICD-10 code for COVID-19 in April 2020 allowed confirmed cases to be documented in claims data. The coding logic is detailed in Appendix A and includes the ICD-10 codes used in the COVID-19 Scale.

The original intent of the scale was to identify persons who had COVID-19 at any time during 2020, and to assign the highest severity level experienced by each person. If a person had more than one documented episode of COVID-19, the highest severity level was assigned to that person so that person-based research would capture the most debilitating state attained.

Like the WHO Progression Scale, we tiered the COVID-19 Scale by ordinal levels according to symptomatology, resource use, and mortality. The individual ranks are clearly defined, mutually exclusive, and ordered in a hierarchical progression reflecting clinical` deterioration [11]. Levels 1–3 had no documentation of presentation to or treatment at an emergency department or an inpatient facility yet are differentiated by documentation of diagnosis and symptoms (Level 1: no documented diagnosis but personal report of COVID-19, Level 2: diagnosis code but no symptom code, Level 3: diagnosis code and symptom code(s)). Because the scale was used initially to assess “subacute COVID” and other individual health impacts, we defined Level 1 to represent a personal history of COVID as documented in the claims without confirmatory diagnostic evidence. Because less than 1% met criteria for Level 1 - no confirmatory diagnosis yet personal report of COVID-19, this level was excluded from further analysis because a diagnosis did not exist in the data.

The other ambulatory-only levels were delineated by the existence of symptoms Levels 2 and 3. Level 4 reflected treatment for COVID-19 at an emergency department without inpatient admission. Levels 5–9 all required an inpatient admission with progressively increasing levels of resource use or procedures reflecting respiratory distress or organ failure (Level 5: Hospital admit no oxygen, Level 6: Hospital admit with non-invasive oxygen, Level 7: Hospital admit with mechanical ventilation, Level 8: Hospital admit with mechanical ventilation and renal dialysis or ECMO). Level 9 indicated death during the hospital treatment. See Appendix B for detailed Level definitions.

Costs were computed for each patient with COVID-19 in 2020, considering only the costs associated with claims that included a COVID-19 diagnosis. Total costs were based upon both hospital/facility and professional charges. It was noted that approximately 1% of the claims in the database had the charges and paid amounts recorded as “$0” or $0.01”, and it was determined that these were claims that exceeded an annual individual stop-loss amount for that member. In these cases, the commercial health plan reallocated the claim to a stop-loss policy and the excess amount is shown as “0” but all other claim details remain in the data. These cases were excluded from the cost analysis because the total cost could not be computed.

A generalized gamma regression analysis was used because the univariate relationship between severity level and cost was non-linear. Severity levels were backwards difference coded to compare each level to the level directly prior in the regression. A sensitivity analysis was done on only those identified as having COVID-19 after April 2020 when the COVID-19 ICD-10 code was developed. Results were near identical to original regression with complete population, verifying the measures (Appendix C). This analysis was performed using SAS software, Version 9.4 of the SAS System [12]. This study was reviewed and approved by the Institutional Review Board of the University of Texas Health Science Center at Houston, and all methods were performed in accordance with the relevant guidelines and regulations.

Data Availability

The database used for the development of this scale was Optum’s Clinformatics® Data Mart (CDM) which is derived from administrative health claims for members of large commercial and Medicare Advantage health plans. The data that support the findings of this study are available from Optum but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Aggregated data findings may however beavailable from the authors upon reasonable request and with permission of Optum.


  1. Bartsch SM, Ferguson MC, McKinnell JA, O’Shea KJ, Wedlock PT, Siegmund SS, Lee BY. The Potential Health Care Costs And Resource Use Associated With COVID-19 In The United States. Health Affairs,, HEALTH AFFAIRS 39, NO. 6 (2020): 927–935

  2. Tsai Y, Vogt TM, Zhou F. Patient Characteristics and Costs Associated With COVID-19–Related Medical Care Among Medicare Fee-for-Service Beneficiaries,Ann Intern Med. doi:

  3. Miethke-Morais A, Cassenote A, Piva H, et al. COVID-19-related hospital cost-outcome analysis: the impact of clinical and demographic factors. Braz J Infect Dis. 2021;25(4):101609.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Huespe I, Carboni Bisso I, Di Stefano S, Terrasa S, Gemelli NA, Las Heras M. Med Intensiva (Engl Ed). 2020. COVID-19 Severity Index: A predictive score for hospitalized patients [published online ahead of print, 2020 Dec 29]S0210-5691(20)30396-X.

  5. Altschul DJ, Unda SR, Benton J, et al. A novel severity score to predict inpatient mortality in COVID-19 patients. Sci Rep. 2020;10:16726.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. WHO Working Group on the Clinical Characterisation and Management of COVID-19 infection. A minimal common outcome measure set for COVID-19 clinical research. Lancet Infect Dis. 2020 Aug;20(8):e192-e197. doi: 10.1016/S1473-3099(20)30483-7. Epub 2020 Jun 12. Erratum in: Lancet Infect Dis. 2020 Oct;20(10):e250. PMID: 32539990; PMCID: PMC7292605.

  7. Haimovich AD, Ravindra NG, Stoytchev S, van Dijk D, Schulz WL, Taylor RA. Development and Validation of the Quick COVID-19 Severity Index: A Prognostic Tool for Early Clinical Decompensation.Infectious Disease/Original Research, Vol 76, Issue 4, p442–453, October 01, 2020

  8. Sen P, Yamuna TK, Candela S et al. Burden and characteristics of COVID-19 in the United States during 2020. Nature 598, 338–341 (2021). Accessed October 26, 2021

  9. Majumder MS, Rose S, Health Care Claims Data May Be Useful For COVID-19 Research Despite Significant Limitations. " Health Affairs Blog October. 2020;6.

  10. Centers for Disease Control CDC., ICD-10-CM Official Coding Guidelines – Supplement, Coding encounters related to COVID-19 Coronavirus Outbreak, Effective: February 20, 2020COVID-10 clinical presentation:

  11. Centers for Disease Control CDC, ICD-10-CM Official Guidelines for Coding and, Reporting. FY 2021 – UPDATED January 1, 2021, (October 1, 2020 - September 30, 2021) ICD-10-CM Official Guidelines for Coding and Reporting, FY 2021,

  12. SAS Institute Inc. SAS/ACCESS® 9.4 interface to ADABAS: reference. Cary, NC: SAS Institute Inc; 2013.

    Google Scholar 

Download references




No funding was received for this study.

Author information

Authors and Affiliations



Trudy Millard Krause and Raymond Greenber, along with Lopita Ghosh conceptualized the scale and wrote the main manuscript text. Caroline Schaeffer, Gina Hansen, and Joseph Wozny conducted data analyses and contributed to the data interpretation and methods section and prepared the tables. All authors reviewed the manuscript.

Corresponding author

Correspondence to Trudy Millard Krause.

Ethics declarations

Ethics approval and consent to participate

Not Applicable, the data were statistically de-identified and are considered administrative data. Research permission was granted by the IRB committee of The University of Texas Health Science Center at Houston and all methods were performed in accordance with the relevant guidelines and regulations.

Consent for publication

Not Applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Krause, T.M., Greenberg, R., Ghosh, L. et al. COVID-19 severity scale for claims data research. BMC Health Serv Res 23, 402 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: