Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study

Karmel, Rosemary; Anderson, Phil; Gibson, Diane; Peut, Ann; Duckett, Stephen; Wells, Yvonne

doi:10.1186/1472-6963-10-41

Research article
Open access
Published: 18 February 2010

Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study

Rosemary Karmel^1,2,
Phil Anderson^1,2,
Diane Gibson¹,
Ann Peut^1,2,
Stephen Duckett³ &
…
Yvonne Wells⁴

BMC Health Services Research volume 10, Article number: 41 (2010) Cite this article

5738 Accesses
30 Citations
Metrics details

Abstract

Background

In Australia, many community service program data collections developed over the last decade, including several for aged care programs, contain a statistical linkage key (SLK) to enable derivation of client-level data. In addition, a common SLK is now used in many collections to facilitate the statistical examination of cross-program use. In 2005, the Pathways in Aged Care (PIAC) cohort study was funded to create a linked aged care database using the common SLK to enable analysis of pathways through aged care services.

Linkage using an SLK is commonly deterministic. The purpose of this paper is to describe an extended deterministic record linkage strategy for situations where there is a general person identifier (e.g. an SLK) and several additional variables suitable for data linkage. This approach can allow for variation in client information recorded on different databases.

Methods

A stepwise deterministic record linkage algorithm was developed to link datasets using an SLK and several other variables. Three measures of likely match accuracy were used: the discriminating power of match key values, an estimated false match rate, and an estimated step-specific trade-off between true and false matches. The method was validated through examining link properties and clerical review of three samples of links.

Results

The deterministic algorithm resulted in up to an 11% increase in links compared with simple deterministic matching using an SLK. The links identified are of high quality: validation samples showed that less than 0.5% of links were false positives, and very few matches were made using non-unique match information (0.01%). There was a high degree of consistency in the characteristics of linked events.

Conclusions

The linkage strategy described in this paper has allowed the linking of multiple large aged care service datasets using a statistical linkage key while allowing for variation in its reporting. More widely, our deterministic algorithm, based on statistical properties of match keys, is a useful addition to the linker's toolkit. In particular, it may prove attractive when insufficient data are available for clerical review or follow-up, and the researcher has fewer options in relation to probabilistic linkage.

Peer Review reports

Background

Since the 1980s the Australian Government has implemented numerous reforms to expand the focus of provision of aged care provision from residential care to include a wide range of community care services. Computerised person-level data are collected for administrative purposes for all residential care, but for only some community care programs. To fill this data gap, over the last 10 years several client-level national minimum datasets have been developed to provide regular information on government community service programs [1–3]. However, aged care programs generally have many distinct service providers, both government and non-government, and, for many programs, clients do not have a unique identifier within the program dataset that can be used to identify readily all of a person's program use.

To enable the derivation of client-level data, many of the community service program data collections contain a statistical linkage key (SLK) based on the concatenation of selected letters of name, date of birth and sex. The purpose of the SLK is to enable data linkage for statistical and research purposes while protecting client privacy, and not for client identification for administrative use or case management. In this context, "the process of linking client records does not need to be 100% accurate. Rather, statistical record linkage need only be sufficiently accurate to enable the drawing of statistically valid conclusions" [4].

In the Australian community services datasets the purpose of the SLK is then primarily to allow client-level analysis within particular government programs. However, when developing the datasets it was realised that the use of a common linkage key would greatly facilitate the statistical examination of cross-program use and care pathways. To allow for this possibility, a common SLK is now used in a number of data collections on government community services programs, including those relating to aged care and disability [1–3]. Linking the datasets is not routine, and requires approval from a properly constituted ethics committee (see [5]) and permission from all relevant data custodians.

Data linkage is a powerful tool both for identifying multiple appearances of individuals within a dataset and for integrating client information across datasets. As the information recorded for an individual may vary from dataset to dataset - due to either differences in reporting (e.g. in first name) or errors - a robust linkage process should allow for some discrepancy in reported characteristics. There are two main types of data linkage: probabilistic record linkage in which the linkage of records in two (or more) files is based on the probabilities of agreement and disagreement between a range of match variables, and deterministic record linkage in which the linkage of records is based on exact agreement of match variables.

Probabilistic matching allows for variation in reported characteristics by deriving a measure of similarity across variables used to identify matches, called the match weight. This is then used to decide whether a particular pair-wise comparison between records on two datasets is accepted (high weight) or rejected (low weight) as a match, or link [6, 7]. Clerical review of possible record matches is often used to decide both the total weight above which record pairs are acceptable as a match and to determine whether matches with weights near this boundary should be considered to be valid [8–11]. Less commonly, a cut-off point for accepting matches is estimated using statistical models of probabilistic linkage, circumventing the need for clerical review [12]. In this approach detailed limited clerical follow-up may also be desirable to validate the process.

Simple deterministic linkage cannot allow for variation in reporting. However, deterministic algorithms can be constructed which can, and "[A]n intricate deterministic algorithm can be as successful - or more successful - than probabilistic algorithms in identifying valid links" [8].

Irrespective of method, when linking any pair of records four outcomes are possible: a true match (true positive), no match (true negative), a mis-match (false positive) and a missed match (false negative). In any linkage study, false negatives are caused by inconsistent reporting (or non-reporting) of data items across different datasets (i.e. data quality/consistency issues), while false positives are caused by different people, either rightly or wrongly, having common linkage data. False negatives are more likely to occur in simple deterministic matching than in probabilistic matching because client information may change depending on who provides the data and when the information is collected [13].

When linking using an SLK, additional data may be available that could assist record matching, with the auxiliary information providing a platform from which variation in the reported SLK information can be considered. The purpose of this paper is to describe a stepwise deterministic record linkage strategy for situations where there is a general person identifier (e.g. an SLK) and several additional variables suitable for data linkage. The auxiliary data may be available for nearly all clients or for a particular subset. This approach can allow for variation in client information across large databases in health and community care, and is demonstrated for the Pathways in Aged Care (PIAC) cohort study.

Methods

Current context: Pathways in Aged Care (PIAC) cohort study

Over a period people may access several community care programs and/or residential aged care. Coordination of these aged care services is important both to provide services cost-effectively and to provide the appropriate care for people at the appropriate time [14]. Between 2001-02 and 2005-06, four key programs in Australia accounted for around 85% of government expenditure on programs delivering community aged care (excluding assessment services) [15]. Datasets for these key community care programs, along with those for aged care assessments, residential aged care and deaths, are included in the PIAC study (see Table 1 for a brief description).

Table 1 Data in the PIAC project

Full size table

Client level data became available nationally for all the main aged care and assessment programs with the implementation in April 2003 of the person-level Aged Care Assessment Program (ACAP) national minimum dataset Version 2 [2]. However, the data for the various programs are held on different datasets so that analyses have continued primarily to be program-specific [16–19].

The aged care datasets do not share a common unique person identifier, and most do not contain full name data. Nevertheless, all either explicitly contain or have sufficient information to derive a common statistical linkage key termed SLK-581. This key, first proposed for the dataset for the Home and Community Care program [20], is the concatenation of five selected letters of name, eight digit date of birth and sex. Analyses have shown that SLK-581 distinguishes well between individuals in aged care datasets [20–22]. Consequently, the datasets can be linked using this SLK. In addition, depending on the datasets being matched, there are common data items - such as area of usual residence and event data - that could be used to aid the record linkage.

In 2005, a research team centred at the Australian Institute of Health and Welfare (AIHW) successfully applied for a National Health and Medical Research Council Strategic Award to undertake the PIAC cohort study. The purpose of this project is to create a linked dataset using data from the main aged care and assessment programs and to then undertake analyses of pathways in aged care over 24 months from the time of aged care assessment in 2003-04.

For the PIAC study, the cohort of interest is people who had a completed aged care assessment reported on the 2003-04 ACAP national dataset Version 2 (just over 105,000 people). The study required linking 10 datasets covering six aged care programs and deaths (Table 1). Deaths data were included to establish whether and when cohort members died within the study period. To be able to identify program use both before and after the reference year (2003-04), service use data from 2002-03 to 2005-06 were included. Aged care reassessments for 2004-05 were also identified.

Before data linkage was undertaken, ethics approval and permission to use the required data were obtained from all relevant bodies. In addition, to protect the privacy of individuals, all linkage was carried out within the AIHW using the Institute's data linkage protocol [23].

Basic strategy

Data linkage for the PIAC cohort study was undertaken using multiple deterministic match passes in conjunction with an algorithm for identifying suitable match keys and the order in which they should be used. The deterministic match keys were based on (but not limited to) components of the common SLK. This approach was chosen because it does not rely on clerical review using name and/or address data - information which is not available on many of the datasets in the PIAC project.

To avoid unnecessary matching processes, a staged approach was employed which progressively linked the datasets two at a time (Figure 1). The order of linking the datasets was based on the availability of additional data for linkage and the quality of the linkage data. For PIAC, there were seven linkage stages in all, with match rates estimated to range from 3% to over 60%, depending on the stage (based on one-step deterministic matching using SLK-581, Table 2).

Table 2 Linkage stages in the PIAC project

Full size table

The stepwise deterministic linkage strategy for each stage consisted of four phases. These are outlined below using stage 3 of the PIAC linkage to illustrate the processes (Figure 1). Stage 3 matches integrated residential care and community care packages (RCCP) data to the PIAC cohort as identified through the 2003-04 ACAP dataset. Simple deterministic matching using SLK-581 showed that at least 64% of the PIAC cohort would link to the RCCP dataset (Table 2).

Phase 1: Client identification within the two datasets

Individual clients are identified differently on the various datasets, depending on whether or not the dataset contains an administrative program client identifier. The RCCP dataset was derived in stage 1 from datasets with program client identifiers, and these were used to identify individuals. A small number of duplicate client records in the administrative datasets were identified and removed [21].

The ACAP dataset used to define the PIAC cohort does not contain either a unique program client identifier or name information, and so clients were defined using SLK-581 in conjunction with broad region of usual residence (first digit of postcode, which denotes Australian state or territory).

Phase 2: Identifying data for matching the specific dataset pair

The data to be used for linking are context-dependent. For stage 3, three additional variables were identified for matching in addition to SLK-581: postcode of residence, aged care assessment date and assessment team identifier.

The RCCP data can contain several postcodes for the same client over a year. When linking RCCP to the PIAC cohort two postcodes were allowed for: one community postcode and one relating to a residential care facility that the client had been in for permanent care during 2003-04.

Phase 3: Identification of keys to use in matching

Match keys to be used for linking were defined in terms of the SLK-581 (divided into five components) and additional available linkage data - postcode, aged care assessment date and assessment team identifier for stage 3. Many combinations of these eight variables could be used to define match keys (see, for example, Table 3 just using SLK-581 and region). To ensure that any employed match keys were based on combinations which both discriminated well between individuals and would not introduce too many false positives, a key was identified as suitable for matching using three criteria:

Discriminating power: 97.5% clients within each dataset had to have a unique value for the match key.

Likelihood of introducing false matches: The estimated theoretical false match rate for links established using the match key could be no more than 0.5%.

Trade-off between additional true and additional false matches: The estimated theoretical trade-off between additional true and additional false matches made with the key given matches already identified had to be at least 2 to 1.

Table 3 128 keys based on components of SLK-581 and region

Full size table

The first of these criteria limits the testing of suitability to those keys that distinguish between individuals with reasonably high probability, while the second and third criteria ensure that an employed key adds few false matches given any matches which have been made in earlier passes. Appendix A provides details of the construction of measures used to implement these criteria, and presents estimates for stage 3.

Any combination of SLK-581 elements and additional match variables (i.e. potential match keys) that met all of the above criteria was used for matching the two datasets. Overall, 115 match keys were selected to match the PIAC cohort to the RCCP dataset, including 19 without ACAP assessment data.

Phase 4: Stepwise matching using selected match keys

Using the selected match keys, stepwise linkage was then carried out, with order of use determined by the discriminating power of the keys (going from high to low). Variation in match key elements identified through previous stages was also incorporated into match steps where relevant, resulting in the use of 215 versions of the 115 selected match keys. All links identified by the selected match keys were accepted as valid, with the exception of duplicate matches. In this case, a match was selected at random.

Validation

Three methods were used to validate the stepwise deterministic linkage strategy. First, the quality of the data used to identify links was examined. Second, we analysed the distribution of identified links across the selected match keys and the consistency between client characteristics for linked pairs. Finally, clerical review of three random samples of links was undertaken to estimate the level of false matches.

Results

For stage 3 of the PIAC linkage 72.6% of the PIAC cohort dataset linked to the RCCP dataset, corresponding to 18.4% of the RCCP dataset (Table 2). Comparing the achieved match rates with those based on simple matching on SLK-581 indicates that using extra information in the match keys (i.e. non SLK-581 data) allowed for variation in reported SLK-581 when matching. The quality of these links depends both on the quality of the data used to establish matches and on the ability of these data to distinguish between individuals. In the discussion below, the quality of the data used in the linkage is examined before analysing the results from the linkage process. The validity of established links is then discussed.

Quality of match data

The presence of missing data reduces the likelihood of identifying true matches. The number of missed matches will also be relatively high if there are unreliable data on either of the datasets. For both datasets in stage 3 more than 97.8% of clients had complete SLK-581 data. However, the PIAC cohort dataset was more likely to have missing SLK-581 elements than RCCP data (2.1% versus 1.1%). In both datasets over 99.9% of client records with complete SLK-581 data had a unique SLK-581.

The availability of additional data (i.e. other than SLK-581) necessarily varies depending on the datasets being matched. For stage 3, 98.2% of the PIAC cohort dataset had valid postcode data. In the RCCP dataset, 99.0% of clients had at least one community postcode and 45.7% had a postcode relating to a residential care facility that the client had been in for permanent care during 2003-04.

Aged care assessment date and team data were available for all clients in the PIAC cohort dataset. On the other hand, many RCCP clients did not have an aged care assessment in 2003-04, and so these data were available for just 27% of all RCCP clients.

Data linkage

Not all match keys identify links. Links between the PIAC cohort and RCCP data were made by 160 match keys out of the 215 used (including versions using previously identified SLK-581 and region variations).

The main purpose of a match key is to establish links. However, a close look at the list of keys used for stage 3 (see Table 4 and Table 5) suggests that many of the links would have been established using less discriminating - but still suitable - keys. Applying all selected keys in order has two advantages. First, using highly discriminating keys obviates the need to choose between records for clients with non-unique linkage information, thereby improving match quality. For stage 3, only 0.01% of matches were made between clients with non-unique match information for the identifying match key on either the PIAC cohort or RCCP datasets.

Table 4 Criteria for selecting match keys for PIAC stage 3 (examples for 60 keys)

Full size table

Table 5 Match results from stepwise matching for first 50 steps for PIAC stage 3

Full size table

Second, the range of elements included in keys can be used to show the relative strength of identified matches. Nearly 60% of stage 3 matches were made with the most detailed (i.e. first) match key. Also, while assessment event data were available for just over one-quarter of the RCCP clients, match keys incorporating these data identified 89% of the matches; however, only 1 in 50 matches needed this data to make the match. Furthermore, all but 0.4% of matches were made using keys for which the proportion of clients with a unique value was over 99.9%. These results indicate that the matches are highly reliable.

Using auxiliary information in the linkage increased the match rate by 11% for stage 3, with just over 90% of links between the PIAC cohort and RCCP datasets matching exactly on SLK-581. Just over 2% of matches were made using SLK-581 and region variations identified in stages 1 and 2. Also, 1.7% of links were for PIAC cohort members with incomplete SLK-581 information on the source ACAP dataset.

Data on region of residence was critical in identifying the additional matches, reflecting the high availability of this information, the strong discriminating power of residence in small regions and the relatively low availability of event data. Overall, 98% of the PIAC-RCCP matches could have been made by match keys which only used components of SLK-581 and region of residence.

Linkage validation

Identified matches were consistent with other client information. Nearly 98% of the PIAC cohort whose assessment on the ACAP 2003-04 dataset was reported as being in residential care linked to the RCCP dataset. Also, there were few inconsistent links between the PIAC cohort, RCCP and deaths data: only 0.16% of the PIAC cohort had a link to a death record with a date of death before either an aged care assessment or program use identified by linking to RCCP data.

As permitted under project ethics approvals, link quality was explicitly investigated using the name information available on the deaths and RCCP datasets to review identified links clerically. Lack of full name on the PIAC cohort dataset prevented the RCCP to cohort linkage from being used for the comparisons.

Three samples of around 1000 RCCP-deaths links (from stage 2) were randomly selected and the full name, date of birth, sex, postcode and event data from the two data sources were manually compared to identify any false positives. Very few false matches were identified in these samples: 0, 1 and 5. This last was for the sample selected from among the 16128 links in which the SLK-581s of matched records were different. Because of the SLK differences, these links were considered to be more likely to be of low quality; however, even in this sample just 0.5% were identified as false positives.

Discussion

Variation in linkage data when identifying matches deterministically was made possible by employing a set of three criteria for judging whether a combination of variables could be used to identify matches without resulting in too many false matches. A measure of a key's discriminating power was used to select a set of match keys to be investigated for use for linking two datasets. An estimated false match rate for each of these keys was then calculated and only match keys with an estimate under 0.5% were retained. Sometimes keys were further excluded on the basis that the estimated number of additional false matches was too high relative to the estimated number of additional true matches when allowing for matches made in earlier passes. All links identified by the selected match keys were accepted as valid, with the exception of the very small number of duplicate matches.

The occurrence of false negatives and false positives when matching between datasets is an issue faced by all studies using data linkage. Under-identification of matches is an expected limitation of linking deterministically using a single SLK, particularly when elements of the key may be missing or set to a default value when unknown. For the PIAC project, the occurrence of missed links was reduced through the use of methodical stepwise linkage using additional data items in combination with the SLK. This facilitated matches when the SLK had been reported inconsistently for individuals, including people with partially missing data in some datasets. In addition, by using the most discriminating match keys first, the algorithm aided correct matching in the small number of cases where different people had identical SLKs. These issues are particularly important for studies covering several years and multiple datasets, as people can report their name or date of birth differently on different occasions [13]. Data on the commonly available variable region of residence was critical in both situations.

The match key selection process and stepwise deterministic matching decreased the chances of making false matches by limiting match keys to those highly likely to lead to true matches, and by reducing the likelihood of duplicate matches. That this approach was successful is indicated by the high level of consistency observed in identified links and the very low false match rates seen in the clerical review of test samples.

The importance of allowing for variation in reported personal information (i.e. in the SLK) was demonstrated by the number of matches made using either identified SLK variations or reduced SLK data in conjunction with other information. Ten per cent of matches made in PIAC linkage stage 3 used a subset of the components of the SLK in combination with other variables, and 2.1% used SLK and region variations identified in earlier linkage stages. This shows that there is considerable variation in how people report their name and date of birth, with both the setting and reason for reporting personal information affecting how it is recorded on each occasion.

The amount of variation in the SLK allowed when linking various datasets depended on the availability of additional linkage information. However, additional data may not be available for all clients. The absence of additional linkage information means that less variation in the SLK can be allowed when identifying matches. Consequently, links for clients without data for additional match variables are more likely to be missed than those for other clients. However, in the PIAC linkage, additional information which was commonly unavailable for clients (i.e. event data) was necessary to identify only a small proportion of links (2% in stage 3), so that the overall effect on analyses is expected to be small.

Conclusions

Simple deterministic matching is usually used to link datasets when matching using an SLK. However, such an approach leads to missing matches due to inconsistent reporting of SLK components. The linkage process for the PIAC cohort study has demonstrated that a stepwise deterministic matching algorithm with appropriate stopping rules can be used to allow for variation in reported personal information.

Three criteria were developed for judging whether a specific combination of variables could be used as a match key to identify matches without resulting in too many false matches. All three measures are readily derivable and provide a means of gauging the likelihood of making false matches and the trade-off between additional true and false matches when reducing the number of variables used to identify a match.

The importance of allowing for variation in reported personal information (i.e. in the SLK) was demonstrated in the current application by the number of matches made using reduced SLK data in conjunction with other information. The very low false match rates seen in sample comparisons using full name data demonstrated that the criteria employed to choose usable match keys restricted the number of false matches. This was confirmed by the high degree of consistency found both in the individual links and in the implied care pathways.

For the PIAC study, the linkage process described in this paper has allowed the compilation of a linked database that will enable a broad range of issues relating to Australian aged care services to be examined for the first time. In particular, it will provide information on the care actually accessed by clients following an assessment for service use and it will allow investigation into factors which influence entry to community or residential care. More widely, the deterministic matching algorithm used to link the various datasets in the PIAC cohort study relied on identifying suitable match keys using a set of criteria concerning their statistical properties. Similar criteria could be used in other linkage projects to allow for variation in client information across large databases in health and community care. Using a deterministic matching algorithm may be attractive in those circumstances when sufficient data are not available for the clerical review or follow-up that a researcher may like to include when using probabilistic linkage.

Appendix A: Identification of match keys for stepwise deterministic matching in PIAC

Successive matches were made using different match keys at each pass, each match key being defined in terms of selected components of SLK-581 and additional available linkage data. Within a PIAC linkage stage, match keys were identified for linking by evaluating a large range of keys based on:

three letters of surname (s3) in SLK-581 (always together)
two letters of given name (g2) in SLK-581 (always together)
day and month of birth (always together)
year of birth
sex
region (using state, full four-digit postcode, or first two digits of postcode)
additional event date for matching
other additional data for matching.

The specific data used for linking at each stage were context-dependent.

There are 128 possible keys based on the presence or absence of the first six elements above, using the three regional groupings separately (Table 3).

Many of the keys listed in Table 3 are too broad to be considered for deterministic linking: on face value alone keys 100 to 128 would not be contemplated. To select keys to be considered for matching, only keys that were estimated to have fewer than four times as many people with non-unique match keys as SLK-581 were investigated. Keys 1 to 20 in Table 3 met this criterion. Using additional variables increases the number of theoretically possible distinct keys and hence the number of potential match keys that meet this criterion. The least specific keys investigated for use when aged care assessment date, assessment team identifier or date of death (in conjunction with region) were available were keys 64, 46 and 74, respectively, in Table 3.

In the absence of clerical review, the decision on whether a particular match key combination could be used for matching two particular datasets was based on three measures:

Measure A

A measure of discriminating power (termed the joint unique key rate, and expressed as a percentage). This is the product of the unique key rates for the two datasets being linked, where the unique key rate is the proportion of records within a dataset that have a unique value for the key in question. A discriminating power of at least 95.0% was required for the key to be considered for matching, equating to a unique key rate of at least 97.5% within each dataset. This measure assumes that rare combinations of variables within a dataset are less likely than more common combinations to result in false matches between datasets.

Measure B

An estimated false match rate (FMR) for links established using the match key. This had to be less than 0.5% for a key to be considered for the matching process.

Measure C

Estimated trade-off between additional true and additional false matches for links established using the match key when compared with matches made by a slightly more precise key. The ratio of additional true to additional false matches had to be at least 2:1 for the key to be used for matching.

All three criteria had to be met by a potential match key for it to be included in the stepwise linking.

The above three measures were calculated prior to full linkage, with the latter two derived by applying the approximation methods outlined in Karmel and Gibson [24]. In the current context, when matching data for program 1 and program 2, FMR was approximated by

where

P is the size of the population (in 1000 s)

r is the usage rate of program 1 as indicated by its dataset (per 1000 people in population P)

α is the proportion of people using program 2 matching to those using program 1 when using a specific match key combination to match

β reflects the number of comparison cells specified by the components of the particular match key being used, allowing for uneven client spread across cells.

The linkage rate α for a specific match key was gauged by using only that key for simple deterministic matching. FMR was then derived (measure B), which also allowed estimation of the trade-off between true and false matches for a specific match key combination when compared with the results from a more exact key (measure C).

Using match keys that met the above criteria, stepwise linkage was then carried out, with order of use determined by the discriminating power of the keys (measure A, going from high to low). Missing information on common elements in a key could not lead to a match as missing data were set to different values in each dataset.

The key selection and stepwise matching processes are illustrated in Tables 4 and 5 for stage 3, which matches RCCP data to the PIAC cohort as identified through the 2003-04 ACAP dataset. For comparative purposes, an estimated 'worst case' false match rate is also presented in Table 4. Overall, 76289 matches were identified in stage 3: 60780 using keys incorporating assessment date, a further 7255 using keys with assessment team identifier, and 8254 using keys which did not include assessment data (Table 5). Under 600 matches (0.8%) were made using keys with an estimated 'worst case' FMR of more than 10% (just 15 matches with an estimated 'worst case' FMR of more than 20%).

Note that measure A is related to the 'global u probability' used in probabilistic matching to give a measure of how likely it is that two values will agree by chance [10]. The global u probability assumes that each value of a variable has the same probability and so is defined as

For the current application, within a dataset the global u for a specific match key is estimated by

where S is the size of the dataset (and assuming no more than 2 duplicates per key). For a key with a uniqueness rate of 97.5% within the dataset, then u ≈ 1/{0.9875*S}.

Note also that the proportions of matches made using keys incorporating a subset of elements of SLK-581 provide an indication of the likelihood of inconsistent reporting of name, sex and date of birth data; that is, they indicate the size of the relevant m probabilities used in probabilistic linkage, where the m probability is the probability that a data element is recorded identically for the same client on two occasions. Matches made using different levels of region data also provide a measure of consistency in reported region of residence.

Abbreviations

AIHW:: Australian Institute of Health and Welfare
ACAP:: Aged Care Assessment Program
FMR:: False match rate
PIAC:: Pathways in Aged Care
RAC:: residential aged care
RCCP:: residential care and community care packages
SLK:: statistical linkage key
SLK-581:: statistical linkage key for a person, being the concatenation of five letters of name (the 2^nd, 3^rd and 5^th letters of the family name, and the 2^nd and 3^rd letters of the given name, substituting '2' for short names), date of birth (ddmmyyyy) and sex (1 for male or 2 for female)

References

Home and Community Care Program minimum data set, 2002-03 annual bulletin. Australian Government Department of Health and Ageing. [http://www.health.gov.au/internet/main/publishing.nsf/Content/28B09A156480B595CA256F1900108686/$File/mds_sb0203.pdf]
Aged Care Assessment Program National Data Repository: Aged Care Assessment Program National Data Repository: minimum data set report, Annual Report 2003-04. 2005, Melbourne: La Trobe University
Google Scholar
Australian Institute of Health and Welfare (AIHW): Disability support services provided under the Commonwealth/State Disability Agreement, national data 1999. Canberra. 2000
Google Scholar
Ryan T, Holmes B, Gibson D: A National Minimum Data Set for Home and Community Care. 1999, Canberra: AIHW, 76.
Google Scholar
National Health and Medical Research Council (NHMRC), Australian Research Council, Australian Vice-Chancellors' Committee: National Statement on Ethical Conduct in Human Research. 2007, Canberra: NHMRC
Google Scholar
Fellegi IP, Sunter AB: A Theory for Record Linkage. Journal of the American Statistical Association. 1969, 1183-1210. 10.2307/2286061. 64
Jaro M: Probabilistic linkage of large public health data files. Statistics in Medicine. 1995, 14: 112-121. 10.1002/sim.4780140510.
Article Google Scholar
Campbell KM: Rule Your Data with The Link King^© (a SAS/AF^® application for record linkage and unduplication). SUGI 30: 2005; Paper 020-030. 2005, Philadelphia: SAS, [http://www2.sas.com/proceedings/sugi30/020-30.pdf]
Google Scholar
Ascential Software Corporation: WebSphere® QualityStage Version 8: User guide. 2006, IBM
Google Scholar
Data Integration Manual. Statistics New Zealand. [http://www.stats.govt.nz/about_us/policies-and-guidelines/data-integration/further-technical-information.aspx]
Herzog T, Scheuren FJ, Winkler WE: Data quality and record linkage techniques. 2007, New York: Springer
Google Scholar
Meray N, Reitsma JB, Ravelli ACJ, Bonsel GJ: Probabilistic record linkage is a valid and transparent tool to combine databases without a patient identification number. Journal of Clinical Epidemiology. 2007, 60: 883-891. 10.1016/j.jclinepi.2006.11.021.
Article PubMed Google Scholar
National Community Services Information Management Group: Statistical data linkage in community services data collections: a report prepared by the Statistical Linkage Key Working Group. 2004, Canberra: AIHW
Google Scholar
Gray L: Two year review of aged care reforms. 2001, Canberra: DHAC
Google Scholar
Hales C, Peut A: Ageing and aged care. Australia's welfare 2007. Edited by: Mathur S, Gibson D. 2007, Canberra: AIHW, table 3.23.
Google Scholar
AIHW: Residential aged care in Australia 2004-05: a statistical overview. 2006, Canberra: AIHW
Google Scholar
AIHW: Community Aged Care Packages in Australia 2004-05. 2006, Canberra: AIHW
Google Scholar
Aged Care Assessment Program National Data Repository: Aged Care Assessment Program National Data Repository: minimum data set report, Annual Report 2004-05. 2006, Melbourne: La Trobe University
Google Scholar
Home and Community Care Program minimum data set, 2003-04 annual bulletin. Australian Government Department of Health and Ageing. [http://www.health.gov.au/internet/main/publishing.nsf/Content/28B09A156480B595CA256F1900108686/$File/mds_annual_04.pdf]
Ryan T, Holmes B, Gibson D: A National Minimum Data Set for Home and Community Care. 1999, Canberra: AIHW
Google Scholar
Karmel R: Data linkage protocols using a statistical linkage key. 2005, Canberra: AIHW
Google Scholar
Karmel R: Transitions between aged care services. 2005, Canberra: AIHW
Google Scholar
Data linkage and protecting privacy: a protocol for linking between two or more data sets held within the Australian Institute of Health and Welfare. Australia. [http://www.aihw.gov.au/dataonline/aihw_privacy_protection_protocols_data_linkage.pdf]
Karmel R, Gibson D: Event-based record linkage in health and aged care services data: a methodological innovation. BMC Health Services Research. 2007, 7: 154-10.1186/1472-6963-7-154.
Article PubMed PubMed Central Google Scholar
Gibson D, Liu Z: Aged care. Australia's Welfare 1993: services and assistance. Edited by: Choi C, Foard G, Gibson D, Madden R, Vaughan G. 1993, Canberra: AIHW, 200-265.
Google Scholar
Gibson D, Holmes B, Liu Z: Aged care. Australia's Welfare 1999: services and assistance. Edited by: Choi C, Gibson D, Goss J, Griffin J, Madden R, Madden R, Maples J, Moyle H, Wilson D. 1999, Canberra: AIHW, 165-213.
Google Scholar
Gibson D, Bowler E, Angus P, Braun P, Mason F: Aged care. Australia's Welfare 2001. Edited by: Madden R, Gibson D, Choi C, Maples J, Madden R. 2001, Canberra: AIHW, 199-257.
Google Scholar
Karmel R, Jenkins A, Angus P, Bowler E, Braun P: Ageing and aged care. Australia's Welfare 2003. Edited by: Gibson D, Madden R, Stuer A. 2003, Canberra: AIHW, 275-329.
Google Scholar
Karmel R, Peut A, Bennett S, Hogan R: Ageing and aged care. Australia's Welfare 2005. Edited by: Gibson D, Abello R. 2005, Canberra: AIHW, 134-201.
Google Scholar
Hales C, Peut A: Ageing and Aged Care. Australia's Welfare 2007. Edited by: Mathur S, Gibson D. 2007, Canberra: AIHW, 77-152.
Google Scholar

Pre-publication history

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6963/10/41/prepub

Download references

Acknowledgements

This study is funded by a National Health and Medical Research Council Health Services Research Grant Round 2 2005 (number 41002). The authors thank the government departments for permission to use their data in this project: the Australian Department of Health and Ageing for permission to use data from the Aged and Community Care Management Information System and Aged Care Assessment Program data; the Australian Department of Veterans' Affairs for permission to use and for preparing Veterans' Home Care data; Home and Community Care Officials (from various state and territory departments and the Department of Health and Ageing) for permission to use Home and Community Care program data; and the Australian Institute of Health and Welfare for permission to use the National Death Index. Thanks also go to Rachel Aalders and Ken Tallis who provided very useful comments on drafts of the paper.

Author information

Authors and Affiliations

Faculty of Health, University of Canberra, Canberra, Australian Capital Territory, Australia
Rosemary Karmel, Phil Anderson, Diane Gibson & Ann Peut
Australian Institute of Health and Welfare, Canberra, Australian Capital Territory, Australia
Rosemary Karmel, Phil Anderson & Ann Peut
School of Population Health, University of Queensland, Brisbane, Queensland, Australia
Stephen Duckett
Lincoln Centre for Research on Ageing, Australian Institute for Primary Care, La Trobe University, Melbourne, Victoria, Australia
Yvonne Wells

Authors

Rosemary Karmel
View author publications
You can also search for this author in PubMed Google Scholar
Phil Anderson
View author publications
You can also search for this author in PubMed Google Scholar
Diane Gibson
View author publications
You can also search for this author in PubMed Google Scholar
Ann Peut
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Duckett
View author publications
You can also search for this author in PubMed Google Scholar
Yvonne Wells
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rosemary Karmel.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RK was the principal developer of the linkage strategy, undertook the data linkage and analysis and drafted the manuscript. PA provided statistical advice on developing the linkage strategy. DG and AP designed the study. SD provided advice on research design, particularly in relation to maximizing policy relevance. YW provided advice on the interpretation and use of the ACAP national dataset. All authors viewed and approved the manuscript for publication.

Diane Gibson, Ann Peut contributed equally to this work.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Karmel, R., Anderson, P., Gibson, D. et al. Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study. BMC Health Serv Res 10, 41 (2010). https://doi.org/10.1186/1472-6963-10-41

Download citation

Received: 22 December 2008
Accepted: 18 February 2010
Published: 18 February 2010
DOI: https://doi.org/10.1186/1472-6963-10-41

Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Current context: Pathways in Aged Care (PIAC) cohort study

Basic strategy

Phase 1: Client identification within the two datasets

Phase 2: Identifying data for matching the specific dataset pair

Phase 3: Identification of keys to use in matching

Phase 4: Stepwise matching using selected match keys

Validation

Results

Quality of match data

Data linkage

Linkage validation

Discussion

Conclusions

Appendix A: Identification of match keys for stepwise deterministic matching in PIAC

Measure A

Measure B

Measure C

Abbreviations

References

Pre-publication history

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Health Services Research

Contact us