Skip to content

Enhancing fix() #488

@anirudhgangwal

Description

@anirudhgangwal

I am implementing a Python version of the library for my own use-case - https://github.com/anirudhgangwal/ukpostcodes. The library mimics functionalities available here, including lookup in ONS database (but I don't use a DB/api to postcode.io, just have a set of ~1.8M postcodes).

We parse postcodes from OCR output and the "O" and "I" errors account for almost all our errors. The fix implemented here was helpful in reducing our error significantly. However, I want to understand if there was a reason to not expand this auto-correct further.

Lets take the example of a 3 digit outcode. This can take the following forms:
A9A 9AA
A99 9AA
AA9 9AA

Since the second and third characters can take on both letters or numbers, this library currently only coerces for "L??".

I think there is a possibility to add a new function, or a parameter to function, which returns a list. E.g.

fix(OOO 4SS) => ["O00 4SS", "OO0 4SS", "O0O 4SS"] # try LLN, LNN, and LNL

A quick Python implementation looked like this:

def fix_with_options(s: str) -> List[str]:
    """Attempts to fix a given postcode, covering all options.

    Args:
        s (str): The postcode to fix
    Returns:
        str: The fixed postcode
    """
    if not FIXABLE_REGEX.match(s):
        return s
    s = s.upper().strip().replace(r"\s+", "")
    inward = s[-3:].strip()
    outward = s[:-3].strip()
    outcode_options = coerce_outcode_with_options(outward)
    return [
        f"{coerce_outcode(option)} {coerce_incode(inward)}"
        for option in outcode_options
    ]

def coerce_outcode_with_options(i: str) -> List[str]:
    """Coerce outcode, but cover all possibilities"""
    if len(i) == 2:
        return [coerce("LN", i)]
    elif len(i) == 3:
        outcodes = []
        if is_valid_outcode(outcode := coerce("LNL", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LNN", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LLN", i)):
            outcodes.append(outcode)
        return list(set(outcodes))
    elif len(i) == 4:
        outcodes = []
        if is_valid_outcode(outcode := coerce("LLNL", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LLNN", i)):
            outcodes.append(outcode)
        return list(set(outcodes))
    else:
        return [i]

This reduced our error rate further down (significantly as most errors were with misreading 0). Note for our use case did made sense as after checking with ONS directory there were negligible false positives.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions