Skip to content

Conversation

@Ghazi-raad
Copy link

Fixes #17156

Problem

The return_word_box parameter was splitting words with accented/diacritic characters (ä, ö, ü, é, à, etc.) into separate segments. For example:

  • Grüßen was split into ['Gr', 'üß', 'en']
  • Email addresses like [email protected] were also incorrectly segmented

Root Cause

The get_word_info() method in ppocr/postprocess/rec_postprocess.py was using the regex pattern [a-zA-Z0-9] which only matches ASCII letters and digits, excluding accented characters used in German, French, Polish, and other languages.

Solution

Changed the character classification to use \w with the re.UNICODE flag, which properly matches:

  • All Unicode letter characters (including accented/diacritic characters)
  • Digits from all scripts
  • Excludes underscore (which \w includes but we treat as splitter)

Impact

  • Enables proper word grouping for German, French, Polish, and other languages with accented characters
  • Maintains backward compatibility with existing ASCII text processing
  • No breaking changes to the API or behavior for ASCII-only text

Testing

The fix addresses the specific examples mentioned in #17156:

  • German: Grüßen now stays as one word
  • German: ungewöhnlichen remains intact
  • Email addresses still need the additional email-specific handling discussed in the issue

…_box

Fixes PaddlePaddle#17156

The word segmentation in get_word_info() was using [a-zA-Z0-9] regex which
only matched ASCII letters and digits. This caused words with accented
characters (ä, ö, ü, é, à, etc.) to be incorrectly split into separate
segments.

Changed to use \w with re.UNICODE flag which properly matches:
- All Unicode letter characters (including accented/diacritic characters)
- Digits from all scripts
- Excludes underscore (which \w includes but we want as splitter)

This fix enables proper word grouping for German, French, Polish, and
other languages with accented characters while maintaining backward
compatibility with existing ASCII text processing.

Example: 'Grüßen' now stays as one word instead of ['Gr', 'üß', 'en']
Copilot AI review requested due to automatic review settings November 26, 2025 21:05
@paddle-bot
Copy link

paddle-bot bot commented Nov 26, 2025

Thanks for your contribution!

@CLAassistant
Copy link

CLAassistant commented Nov 26, 2025

CLA assistant check
All committers have signed the CLA.

Copilot finished reviewing on behalf of Ghazi-raad November 26, 2025 21:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes word segmentation for accented characters in OCR text recognition by updating the character classification regex in the get_word_info() method. The change enables proper handling of German, French, Polish, and other languages with diacritic marks (ä, ö, ü, é, à, etc.) when the return_word_box parameter is enabled.

Key Changes:

  • Modified character classification from ASCII-only pattern [a-zA-Z0-9] to Unicode-aware \w pattern with explicit underscore exclusion
  • Added explanatory comments documenting the change and underscore handling
  • Maintains backward compatibility with existing ASCII text processing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 98 to 101
elif bool(re.search(r"[\w]", char, re.UNICODE)) and not char == "_":
# Use \w with UNICODE flag to match letters (including accented chars like ä, ö, ü, é, etc.) and digits
# Exclude underscore since \w includes it but we want to treat it as splitter
c_state = "en&num"
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix for accented character word segmentation lacks test coverage. Consider adding a test case that verifies the word segmentation works correctly with accented characters (e.g., "Grüßen", "ungewöhnlichen") and that underscores are properly treated as splitters.

Example test case structure:

def test_get_word_info_with_accented_chars():
    decoder = BaseRecLabelDecode()
    # Test German with accented characters
    text = "Grüßen"
    selection = np.ones(len(text), dtype=bool)
    word_list, _, state_list = decoder.get_word_info(text, selection)
    assert len(word_list) == 1  # Should be one word, not split
    assert ''.join(word_list[0]) == "Grüßen"

Copilot uses AI. Check for mistakes.
@Bobholamovic
Copy link
Member

Please sign the CLA

Bobholamovic
Bobholamovic previously approved these changes Dec 3, 2025
@luotao1
Copy link
Collaborator

luotao1 commented Dec 5, 2025

Please solve the conflict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

return_word_box parameter with unexpected behavior

4 participants