Databases
Three administrative health databases were employed. We restricted our study population to residents of Calgary Health Region (CHR), Alberta, Canada during fiscal years 1998/99 through 2001/02. As of March 2002, CHR's population was approximately 1.1 million. Infants less than one year were excluded because of missing or inaccurate variables such as name. In Vital Statistics individuals were defined as CHR residents based on Standard Geographical Classification (SGC) for assessment of the linkage between Vital Statistics and the population registry. The SGC was used to classify residential areas based address or community [14]. Of 21679 deaths occurred in the CHR during 1998/99 to 2001/2002, 99 were excluded due to unknown SGC, 1358 were also excluded because of non-CHR residents, and 20222 were finally analyzed. Residential postal codes recorded in the hospital discharge data were used to define CHR residents for assessment of the linkage between Vital Statistics and hospital discharge data.
The first database was the Alberta Health Care Insurance Plan Registry (it was also called the population registry) for the CHR for the fiscal years 1998/1999 to 2001/2002. This registry contained demographic information of health care recipients. Canada has a government-financed universal health insurance system. All permanent residents are covered by the provincial health insurance plan except for Registered First Nations, prison inmates, and members of the military and the Royal Canadian Mounted Police all of which are the responsibility of the federal government. All eligible Alberta residents are assigned a unique lifetime Personal Health Number (PHN). Therefore, PHN is an ideal variable for performing record linkage. The Alberta insurance registry is nearly complete and consistent, and is used as a proxy for the population of Alberta.
The second database was the Alberta Vital Statistics registry for fiscal years 1998/99 to 2001/2002. Information in the death registry was derived from the Death Registration form, medical certificate of death, and the medical examiner's certificate of death (where appropriate). The registry captured deaths that occurred within Alberta but misses Alberta residents who died out of Alberta. Provincial and territorial Vital Statistics registries on all deaths were submitted to Statistics Canada annually.
The third database was the hospital discharge abstracts for the fiscal years 1998/1999 to 2001/2002. Abstracts are filed for all inpatient discharges from all hospitals in the CHR. Professional coders reviewed inpatient charts and extracted data on PHN, demographics, diagnoses, procedures, physician specialty and status of alive or death.
Common identifiers among three databases
We selected surname, first name, sex, and date of birth as our common identifiers because they were less likely to be changed over time, compared to other identifiers like address. We assessed three different combinations of four identifiers: (1) surname (i.e. surname at birth or marital surname as recorded in these three databases), sex and date of birth (i.e. month, day, and year of birth); (2) first name, sex, and date of birth; (3) surname, first name, sex and date of birth.
To perform record linkage, the common identifiers were formatted in the same way across the three databases, capitalizing letters, removing blanks and dashes. One common linkage problem with using the name variable as a common identifier was that an individual's name could be represented in many different ways, with alternate spellings, initials, abbreviations and shortened forms of names making the linkage difficult. To deal with this common linkage issue, a Soundex coding method was employed to identify linkage between names that fail to match due to variant spellings of the names in the two databases [13]. The Soundex algorithm associated numbers with different groups of consonants, producing a numeric code following the initial letter that was robust with respect to variations in names that sound alike.
Correct linkage among the linked records was assessed by checking whether the unique PHN from various sources was identical within the matched record. We excluded matched records without PHN in both files. Although this identifier is complete both in the hospital discharge data and population registry, only 70% of the records in the Vital Statistics registry had a valid PHN, because it was not a mandatory variable.
Deterministic record linkage
Record linkage and correct linkage evaluation between vital statistics and population registry
We linked the Vital Statistics registry with the population registry three times, using each of the three approaches (see Figure 1). In the process of linkage, the Vital Statistics registry was used as the master. The linkage rate between the population registry and Vital Statistics files was calculated using the number of records in the Vital Statistics registry file as the denominator (N in Figure 1) and the number of linked records as numerator (n in Figure 1). In the process of linkage, about 2% of Vital Statistics records were matched with more than one population registry record (i.e., duplicate records) for approach one, about 6% for approach two and less than 0.1% for approach three.
To assess the correct linkage, we checked whether the PHN obtained from the population registry was identical to the PHN recorded in the Vital Statistics registry among the linked records. Specifically we restricted the linked records to those with a valid PHN and checked the proportion of these records where the Vital-PHN matched the population-PHN. The Vital Statistics records with duplicate population registry records no matter whether there were PHNs or not, were defined as incorrect links. The correct linkage rate was calculated using the ratio of linked records with the matched PHNs over those with a valid Vital-PHN (i.e. the ratio of nb over na in the Figure 1).
Record linkage and correct linkage evaluation between in-hospital death records and vital statistics files
We first selected inhospital death records in the hospital discharge data. Because our version of hospital discharge data did not contain names, we linked them with the population registry using PHNs that were present for all records in both files to retrieve the surname and first name from the population registry. In the matching process, the in-hospital death records were accepted as the master. We matched the in-hospital deaths to the Vital Statistics deaths using the three approaches, respectively (see Figure 2). The number of death records in the hospital discharge data (N in Figure 2) was used as a denominator to calculate the linkage rate. Among the three approaches, less than 0.1% of in-hospital death records had more than one match with Vital Statistics files. Likewise, those duplicate matches were defined as incorrect links. Correct linkage was assessed by checking whether the inhospital-PHN matched the Vital-PHN within the linked record. The correct linkage rate was obtained by the ratio of nb divided by na (see Figure 2).