-
Notifications
You must be signed in to change notification settings - Fork 7
Description
I am implementing a Python version of the library for my own use-case - https://github.com/anirudhgangwal/ukpostcodes. The library mimics functionalities available here, including lookup in ONS database (but I don't use a DB/api to postcode.io, just have a set of ~1.8M postcodes).
We parse postcodes from OCR output and the "O" and "I" errors account for almost all our errors. The fix implemented here was helpful in reducing our error significantly. However, I want to understand if there was a reason to not expand this auto-correct further.
Lets take the example of a 3 digit outcode. This can take the following forms:
A9A 9AA
A99 9AA
AA9 9AA
Since the second and third characters can take on both letters or numbers, this library currently only coerces for "L??".
I think there is a possibility to add a new function, or a parameter to function, which returns a list. E.g.
fix(OOO 4SS) => ["O00 4SS", "OO0 4SS", "O0O 4SS"] # try LLN, LNN, and LNL
A quick Python implementation looked like this:
def fix_with_options(s: str) -> List[str]:
"""Attempts to fix a given postcode, covering all options.
Args:
s (str): The postcode to fix
Returns:
str: The fixed postcode
"""
if not FIXABLE_REGEX.match(s):
return s
s = s.upper().strip().replace(r"\s+", "")
inward = s[-3:].strip()
outward = s[:-3].strip()
outcode_options = coerce_outcode_with_options(outward)
return [
f"{coerce_outcode(option)} {coerce_incode(inward)}"
for option in outcode_options
]
def coerce_outcode_with_options(i: str) -> List[str]:
"""Coerce outcode, but cover all possibilities"""
if len(i) == 2:
return [coerce("LN", i)]
elif len(i) == 3:
outcodes = []
if is_valid_outcode(outcode := coerce("LNL", i)):
outcodes.append(outcode)
if is_valid_outcode(outcode := coerce("LNN", i)):
outcodes.append(outcode)
if is_valid_outcode(outcode := coerce("LLN", i)):
outcodes.append(outcode)
return list(set(outcodes))
elif len(i) == 4:
outcodes = []
if is_valid_outcode(outcode := coerce("LLNL", i)):
outcodes.append(outcode)
if is_valid_outcode(outcode := coerce("LLNN", i)):
outcodes.append(outcode)
return list(set(outcodes))
else:
return [i]This reduced our error rate further down (significantly as most errors were with misreading 0). Note for our use case did made sense as after checking with ONS directory there were negligible false positives.