Skip to content

Conversation

@adamdecaf
Copy link
Member

@adamdecaf adamdecaf commented Jul 2, 2025

The idea is to generate a list of sortable keys (buckets the fields hash into) so that we can find records which are similar. You can do a multi-compare against these and grab rows which are greater/less than the keys to shrink the amount of detailed similarity scoring calls to make.

"TYPE:0230"
"NAME:0190"

// Country | Type | Identifier
"GOVID:C0173|T0190|X0146"

// Country | State | PostalCode | City | Line1 | Line2 [optional]
"ADDR:C0143|S0021|P0007|Y0023|L0201,0028,0173"

You could then compute some traditional string distance metrics over these sortable keys to rank what's most similar. The keys move from general data to more specific.

With broad fields on the left this allows for prefix filtering in SQL. You could strip out Line1/Line2 data and filter down to a city level. Or find the rows nearby to an exact address by grabbing those greater and less than the target.

@adamdecaf adamdecaf force-pushed the feat-add-record-linkage branch from c342518 to d1a2f71 Compare November 11, 2025 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant