Skip to content

Bugfix for scrubber sample code which fails when scrubbing "two" #232

@0dB

Description

@0dB

The code in section "Scrubber" of https://derwen.ai/docs/ptr/sample/ has a small bug: When you add a token that also exists as a single term in the file, like "two", the while loop will consume the whole span and span[0] will then fail. Easy fix:

In (using my tokens instead of the ones on the page):

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while span[0].text in ("every", "other", "the", "two"): # ATTN, different tokens, will fail in original code
            span = span[1:]
        return span.text
    return scrubber_func

just add len(span) > 1 and and replace

while span[0].text in ("every", "other", "the", "two"):
by
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):

to get

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
            span = span[1:]
        return span.text
    return scrubber_func

Now, for the sample used on that page, I get

0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]

and the line for "two" is still fine

0.00000000, 02, two, [two, two]

You are welcome to use the token list I used, ("every", "other", "the", "two"), it gives even more merged results than the example on the page.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions