-
Notifications
You must be signed in to change notification settings - Fork 9.5k
fix: support accented characters in word segmentation for return_word… #17201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: support accented characters in word segmentation for return_word… #17201
Conversation
…_box Fixes PaddlePaddle#17156 The word segmentation in get_word_info() was using [a-zA-Z0-9] regex which only matched ASCII letters and digits. This caused words with accented characters (ä, ö, ü, é, à, etc.) to be incorrectly split into separate segments. Changed to use \w with re.UNICODE flag which properly matches: - All Unicode letter characters (including accented/diacritic characters) - Digits from all scripts - Excludes underscore (which \w includes but we want as splitter) This fix enables proper word grouping for German, French, Polish, and other languages with accented characters while maintaining backward compatibility with existing ASCII text processing. Example: 'Grüßen' now stays as one word instead of ['Gr', 'üß', 'en']
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes word segmentation for accented characters in OCR text recognition by updating the character classification regex in the get_word_info() method. The change enables proper handling of German, French, Polish, and other languages with diacritic marks (ä, ö, ü, é, à, etc.) when the return_word_box parameter is enabled.
Key Changes:
- Modified character classification from ASCII-only pattern
[a-zA-Z0-9]to Unicode-aware\wpattern with explicit underscore exclusion - Added explanatory comments documenting the change and underscore handling
- Maintains backward compatibility with existing ASCII text processing
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ppocr/postprocess/rec_postprocess.py
Outdated
| elif bool(re.search(r"[\w]", char, re.UNICODE)) and not char == "_": | ||
| # Use \w with UNICODE flag to match letters (including accented chars like ä, ö, ü, é, etc.) and digits | ||
| # Exclude underscore since \w includes it but we want to treat it as splitter | ||
| c_state = "en&num" |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix for accented character word segmentation lacks test coverage. Consider adding a test case that verifies the word segmentation works correctly with accented characters (e.g., "Grüßen", "ungewöhnlichen") and that underscores are properly treated as splitters.
Example test case structure:
def test_get_word_info_with_accented_chars():
decoder = BaseRecLabelDecode()
# Test German with accented characters
text = "Grüßen"
selection = np.ones(len(text), dtype=bool)
word_list, _, state_list = decoder.get_word_info(text, selection)
assert len(word_list) == 1 # Should be one word, not split
assert ''.join(word_list[0]) == "Grüßen"Co-authored-by: Copilot <[email protected]>
|
Please sign the CLA |
|
Please solve the conflict |
Fixes #17156
Problem
The
return_word_boxparameter was splitting words with accented/diacritic characters (ä, ö, ü, é, à, etc.) into separate segments. For example:Grüßenwas split into['Gr', 'üß', 'en'][email protected]were also incorrectly segmentedRoot Cause
The
get_word_info()method inppocr/postprocess/rec_postprocess.pywas using the regex pattern[a-zA-Z0-9]which only matches ASCII letters and digits, excluding accented characters used in German, French, Polish, and other languages.Solution
Changed the character classification to use
\wwith there.UNICODEflag, which properly matches:\wincludes but we treat as splitter)Impact
Testing
The fix addresses the specific examples mentioned in #17156:
Grüßennow stays as one wordungewöhnlichenremains intact