Skip to content

Python and PySpark implementation of Goldstein et al.'s Scalelink method of data linkage.

License

Notifications You must be signed in to change notification settings

ONSdigital/scalelink

Repository files navigation

scalelink

Code style: ruff Code style: black

An implementation of the probabilistic data linkage method Scalelink, described in Goldstein et al. (2017).

This implementation has been created in response to recommendation six of the Joined Up Data in Government: The Future of Data Linking Methods cross-government review.

This work is authored by the Data Linkage team within the Methodology and Quality Directorate (MQD) of the Office for National Statistics (ONS). This team researches novel data linkage methods and applications to facilitate production of high-quality linked datasets for national statistics.

Pre-requisites

  • Python 3.8-3.10
  • Spark 3.5.1

Method

The Scalelink method is a novel probabilistic data linkage method. It uses a scaling algorithm based on correspondence analysis. A key potential advantage is that it utilises linkage variable dependence. In contrast, the current gold-standard probabilistic data linkage method, the Fellegi-Sunter algorithm, assumes that linkage variables are independent, an assumption which is violated to at least some extent in all linkage problems. For example, forenames are correlated with middle names, surnames, age, gender, home address, etc.

The Scalelink method is experimental. This implementation has been open-sourced to facilitate further research regarding the suitability of the Scalelink method for real-world data linkage, particularly on Big Data. The authors do not currently endorse using the Scalelink method to produce linked datasets for research, analysis or statistics.

Contact

For questions, support or feedback about scalelink, please email [email protected].

Acknowledgements

The authors would like to acknowledge the following collaborators for their contributions to the Scalelink research project within the Office for National Statistics:

  • William Browne (University of Bristol)
  • Christopher Charlton (University of Bristol)
  • James Doidge (Intensive Care National Audit & Research Centre)
  • Harvey Goldstein (University of Bristol and University College London)
  • Katie Harron (University College London)
  • Leah Maizey (Office for National Statistics)
  • Josie Plachta (Office for National Statistics)
  • Rachel Shipsey (Office for National Statistics)
  • Paul Smith (University of Southampton)

The authors would also like to acknowledge the following individuals for their support in making this code public:

  • Dominic Bean (Office for National Statistics)
  • Diego Lara de Andres (Office for National Statistics)
  • Zoe White (Office for National Statistics)

Dedication

This project is dedicated to the memory of Harvey Goldstein (1939-2020).

License

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 license.

About

Python and PySpark implementation of Goldstein et al.'s Scalelink method of data linkage.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages