Record linkage is a set of methodologies designed to bring together information relating to the same person from within or across datasets [1]. This technique is widely used for conducting longitudinal observational health research [2]. The process of record linkage typically involves the comparison of personally identifying information such as name, address and date of birth contained in these records.
Studies using linked data will generally contain linkage error. There are two types of record linkage errors; false positives, where two records are designated as belonging to the same individual when in truth they do not; and false negatives, where two records are designated as belonging to different individuals when in truth they belong to the same individual. Linkage error can occur due to both legitimate changes in a person’s particulars (i.e. change of address) or due to data fields being missing or in error (i.e. poor recording practices in the data collection in question). Researchers using linked data will generally not have the personally identifying information made available to them for privacy reasons [3]. This means they are unable to evaluate the accuracy or quality of the linkage directly and instead must rely on the quality provided by the organisation that performs the linkage.
A number of studies have illustrated how poor linkage quality can cause bias and distort results. A study of the effect of linkage methods (two deterministic strategies and one probabilistic strategy) on mortality rate estimates showed relative differences of up to 25% between the true estimate and that found through record linkage [4]. In a study on child neglect which applied an iterative linkage approach (deterministic followed by probabilistic) to reduce false positives, errors in linkage quality were shown to bias incidence proportions by up to 43% [5]. The incidence proportions were based on the number of children under the age of 6 years from the 2009 births to Alaska residents with at least one multi-source maltreatment report. In a recent study of men with and without HIV, the probabilistic linkage method applied (with estimated sensitivity and specificity of 88.4 and 99.7 respectively) led to a finding of a significantly lower rate of hospitalisation in HIV positive men as compared to the general population (0.46, 0.37–0.58), while improvements in linkage quality revealed the opposite finding; a significantly higher rate of hospitalisation in HIV positive men (1.45, 1.33–1.59) [6].
A key question is whether errors in linkage quality are distributed evenly across a study population. In other words, do the linked records of people from certain subgroups (for instance, people with lower socioeconomic status or individuals in particular ethnic groups) contain a greater proportion of errors? If a certain subgroup was found to have lower linkage quality, this would suggest research results might be systematically biased against this group.
There are a number of plausible reasons why linkage quality may vary between subgroups. Key causes of linkage error are changing or incorrect identifiers [7]. Given the common cultural norm of women changing their surname upon marriage, women may have a higher rate of linkage errors than men. Individuals who are more mobile (that is, change address often), such as younger adults, may also be harder to correctly link, resulting in more errors. Individuals from different ethnic groups may have their name information recorded poorly [8] (for instance due to difficulty spelling, or different transliterations), or may use name conventions different from Western standards (for instance, a very large proportion of Vietnamese women have the same middle name; Thi), which will make their data harder to link. Different ethnic groups may also display different rates of identifier reporting; for instance, in the United States, African American adults are less likely to report social security number, a highly identifying attribute, making their data harder to link [9]. Recording practices may differ between different hospitals, or different types of hospitals, which may service differing constituencies. Of importance to health researchers is knowing whether linkage errors differ between socioeconomic status and by geographic region (metropolitan compared to rural), as these are two key demographic factors known to influence health status [10]. Different health conditions may also be correlated with missing or invalid identifiers; for instance, medical conditions relating to newborns who may not have recorded first names [11]. A recent study of the existence of identifier error in administrative datasets showed the level of error to vary by age, sex and ethnicity [12], indicating the potential for differences in linkage quality across these attributes.
A number of studies have explored the relationship between linkage quality and sociodemographic factors, although typically as part of a wider study. A review by Bohensky et al. [13] found differences in age, sex, ethnicity, geography, socio-economic status and health, although there was little consistency between studies. In general, these studies of variations in linkage quality across or within populations have used the same research design. In nearly all cases, the study method involved the linkage of two datasets, where each dataset contained only one record per person. Each record in the first dataset was expected to be contained within the second dataset. Using this method, records in the first dataset could be divided into two categories; those that matched with a record in the second dataset and those that did not. These unmatched records were then compared against the matched records to determine whether there was any difference in the social and demographic characteristics (i.e. gender, age, socioeconomic status) of these groups [14,15,16,17,18,19,20,21,22,23,24,25,26,27].
Two key issues arise when using this research design to explore bias caused by poor linkage quality. Firstly, the approach focusses only on false negative errors (records which incorrectly did not find a match) and does not consider the issue of false positive errors, an equally important type of linkage error. By only reporting on one of the two error types, we cannot gain an accurate understanding of the relationship between linkage quality and sociodemographic factors.
A second and more fundamental problem in comparing sociodemographic subgroups using this methodology is that we often cannot distinguish whether an individual is unmatched due to linkage quality error, or due to the fact that these matched records do not exist. In many of these cases, differences in linkage bias arise from differences in subgroup data capture. Such differences can be expected to be highly dataset dependent. For example, in a recent US study linking a clinical surgical registry to US Medicare inpatient claims, the unmatched individuals in the clinical surgical registry were more likely to be those who did not have coverage under Medicare, rather than individual whose correct match could not be found [25]. Consequently, differences in matched and unmatched individuals likely reflect the demographics of those who are/are not covered under Medicare. Similarly, in a linkage of a survey cohort to a maternity registry, unmatched records were typically the result of individuals moving outside of the maternity registry catchment area, rather than the result of errors in the linkage process [22]. As such, any found bias is more likely to reflect the different demographics of individuals who emigrate, rather than reflect linkage quality issues.
In this paper, we attempt to address these methodological shortcomings through the use of an alternate research design. Four large, real-world Australian administrative datasets were de-duplicated, with results and compared to an available ‘truth-set’, allowing comparison of both false positive and false negative errors.