fix: support accented characters in word segmentation for return_word… #17201

Ghazi-raad · 2025-11-26T21:05:25Z

Problem

The return_word_box parameter was splitting words with accented/diacritic characters (ä, ö, ü, é, à, etc.) into separate segments. For example:

Grüßen was split into ['Gr', 'üß', 'en']
Email addresses like [email protected] were also incorrectly segmented

Root Cause

The get_word_info() method in ppocr/postprocess/rec_postprocess.py was using the regex pattern [a-zA-Z0-9] which only matches ASCII letters and digits, excluding accented characters used in German, French, Polish, and other languages.

Solution

Changed the character classification to use \w with the re.UNICODE flag, which properly matches:

All Unicode letter characters (including accented/diacritic characters)
Digits from all scripts
Excludes underscore (which \w includes but we treat as splitter)

Impact

Enables proper word grouping for German, French, Polish, and other languages with accented characters
Maintains backward compatibility with existing ASCII text processing
No breaking changes to the API or behavior for ASCII-only text

Testing

The fix addresses the specific examples mentioned in #17156:

German: Grüßen now stays as one word
German: ungewöhnlichen remains intact
Email addresses still need the additional email-specific handling discussed in the issue

…_box Fixes PaddlePaddle#17156 The word segmentation in get_word_info() was using [a-zA-Z0-9] regex which only matched ASCII letters and digits. This caused words with accented characters (ä, ö, ü, é, à, etc.) to be incorrectly split into separate segments. Changed to use \w with re.UNICODE flag which properly matches: - All Unicode letter characters (including accented/diacritic characters) - Digits from all scripts - Excludes underscore (which \w includes but we want as splitter) This fix enables proper word grouping for German, French, Polish, and other languages with accented characters while maintaining backward compatibility with existing ASCII text processing. Example: 'Grüßen' now stays as one word instead of ['Gr', 'üß', 'en']

paddle-bot · 2025-11-26T21:05:30Z

Thanks for your contribution!

CLAassistant · 2025-11-26T21:05:33Z

All committers have signed the CLA.

Copilot

Pull request overview

This PR fixes word segmentation for accented characters in OCR text recognition by updating the character classification regex in the get_word_info() method. The change enables proper handling of German, French, Polish, and other languages with diacritic marks (ä, ö, ü, é, à, etc.) when the return_word_box parameter is enabled.

Key Changes:

Modified character classification from ASCII-only pattern [a-zA-Z0-9] to Unicode-aware \w pattern with explicit underscore exclusion
Added explanatory comments documenting the change and underscore handling
Maintains backward compatibility with existing ASCII text processing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ppocr/postprocess/rec_postprocess.py

Copilot · 2025-11-26T21:08:10Z

ppocr/postprocess/rec_postprocess.py

+            elif bool(re.search(r"[\w]", char, re.UNICODE)) and not char == "_":
+                # Use \w with UNICODE flag to match letters (including accented chars like ä, ö, ü, é, etc.) and digits
+                # Exclude underscore since \w includes it but we want to treat it as splitter
                c_state = "en&num"


This fix for accented character word segmentation lacks test coverage. Consider adding a test case that verifies the word segmentation works correctly with accented characters (e.g., "Grüßen", "ungewöhnlichen") and that underscores are properly treated as splitters.

Example test case structure:

def test_get_word_info_with_accented_chars(): decoder = BaseRecLabelDecode() # Test German with accented characters text = "Grüßen" selection = np.ones(len(text), dtype=bool) word_list, _, state_list = decoder.get_word_info(text, selection) assert len(word_list) == 1 # Should be one word, not split assert ''.join(word_list[0]) == "Grüßen"

Co-authored-by: Copilot <[email protected]>

Bobholamovic · 2025-12-02T03:54:32Z

Please sign the CLA

luotao1 · 2025-12-05T03:23:03Z

Please solve the conflict

Copilot AI review requested due to automatic review settings November 26, 2025 21:05

paddle-bot bot added the contributor label Nov 26, 2025

Copilot started reviewing on behalf of Ghazi-raad November 26, 2025 21:05 View session

Copilot finished reviewing on behalf of Ghazi-raad November 26, 2025 21:07

Copilot AI reviewed Nov 26, 2025

View reviewed changes

Update ppocr/postprocess/rec_postprocess.py

aec0b2c

Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into fix/langchain-docstore-import-17186

437e3e7

Bobholamovic previously approved these changes Dec 3, 2025

View reviewed changes

Merge branch 'main' into fix/langchain-docstore-import-17186

1021bfe

Ghazi-raad dismissed Bobholamovic’s stale review via 1021bfe December 5, 2025 03:36

Bobholamovic approved these changes Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: support accented characters in word segmentation for return_word… #17201

fix: support accented characters in word segmentation for return_word… #17201

Ghazi-raad commented Nov 26, 2025

Uh oh!

paddle-bot bot commented Nov 26, 2025

Uh oh!

CLAassistant commented Nov 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Bobholamovic commented Dec 2, 2025

Uh oh!

luotao1 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: support accented characters in word segmentation for return_word… #17201

Are you sure you want to change the base?

fix: support accented characters in word segmentation for return_word… #17201

Conversation

Ghazi-raad commented Nov 26, 2025

Problem

Root Cause

Solution

Impact

Testing

Uh oh!

paddle-bot bot commented Nov 26, 2025

Uh oh!

CLAassistant commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Bobholamovic commented Dec 2, 2025

Uh oh!

luotao1 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Nov 26, 2025 •

edited

Loading