Since the 1980s the Australian Government has implemented numerous reforms to expand the focus of provision of aged care provision from residential care to include a wide range of community care services. Computerised person-level data are collected for administrative purposes for all residential care, but for only some community care programs. To fill this data gap, over the last 10 years several client-level national minimum datasets have been developed to provide regular information on government community service programs [1–3]. However, aged care programs generally have many distinct service providers, both government and non-government, and, for many programs, clients do not have a unique identifier within the program dataset that can be used to identify readily all of a person's program use.
To enable the derivation of client-level data, many of the community service program data collections contain a statistical linkage key (SLK) based on the concatenation of selected letters of name, date of birth and sex. The purpose of the SLK is to enable data linkage for statistical and research purposes while protecting client privacy, and not for client identification for administrative use or case management. In this context, "the process of linking client records does not need to be 100% accurate. Rather, statistical record linkage need only be sufficiently accurate to enable the drawing of statistically valid conclusions" .
In the Australian community services datasets the purpose of the SLK is then primarily to allow client-level analysis within particular government programs. However, when developing the datasets it was realised that the use of a common linkage key would greatly facilitate the statistical examination of cross-program use and care pathways. To allow for this possibility, a common SLK is now used in a number of data collections on government community services programs, including those relating to aged care and disability [1–3]. Linking the datasets is not routine, and requires approval from a properly constituted ethics committee (see ) and permission from all relevant data custodians.
Data linkage is a powerful tool both for identifying multiple appearances of individuals within a dataset and for integrating client information across datasets. As the information recorded for an individual may vary from dataset to dataset - due to either differences in reporting (e.g. in first name) or errors - a robust linkage process should allow for some discrepancy in reported characteristics. There are two main types of data linkage: probabilistic record linkage in which the linkage of records in two (or more) files is based on the probabilities of agreement and disagreement between a range of match variables, and deterministic record linkage in which the linkage of records is based on exact agreement of match variables.
Probabilistic matching allows for variation in reported characteristics by deriving a measure of similarity across variables used to identify matches, called the match weight. This is then used to decide whether a particular pair-wise comparison between records on two datasets is accepted (high weight) or rejected (low weight) as a match, or link [6, 7]. Clerical review of possible record matches is often used to decide both the total weight above which record pairs are acceptable as a match and to determine whether matches with weights near this boundary should be considered to be valid [8–11]. Less commonly, a cut-off point for accepting matches is estimated using statistical models of probabilistic linkage, circumventing the need for clerical review . In this approach detailed limited clerical follow-up may also be desirable to validate the process.
Simple deterministic linkage cannot allow for variation in reporting. However, deterministic algorithms can be constructed which can, and "[A]n intricate deterministic algorithm can be as successful - or more successful - than probabilistic algorithms in identifying valid links" .
Irrespective of method, when linking any pair of records four outcomes are possible: a true match (true positive), no match (true negative), a mis-match (false positive) and a missed match (false negative). In any linkage study, false negatives are caused by inconsistent reporting (or non-reporting) of data items across different datasets (i.e. data quality/consistency issues), while false positives are caused by different people, either rightly or wrongly, having common linkage data. False negatives are more likely to occur in simple deterministic matching than in probabilistic matching because client information may change depending on who provides the data and when the information is collected .
When linking using an SLK, additional data may be available that could assist record matching, with the auxiliary information providing a platform from which variation in the reported SLK information can be considered. The purpose of this paper is to describe a stepwise deterministic record linkage strategy for situations where there is a general person identifier (e.g. an SLK) and several additional variables suitable for data linkage. The auxiliary data may be available for nearly all clients or for a particular subset. This approach can allow for variation in client information across large databases in health and community care, and is demonstrated for the Pathways in Aged Care (PIAC) cohort study.