Skip to content

Conversation

@Ihebdhouibi
Copy link
Contributor

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

This PR fixes a bug where French words containing accented characters (é, è, à, ç, etc.) and contractions (n'êtes, l'été) were incorrectly split at each accented character during OCR text recognition word grouping.

🐛 Problem Description

Issue

The BaseRecLabelDecode.get_word_info() method in ppocr/postprocess/rec_postprocess.py only recognized basic ASCII letters (a-z, A-Z) as word characters. Accented characters used in French and other Latin-based languages were incorrectly classified as "splitters", causing words to be broken apart.

Example of the Bug

Before the fix:

  • Input: "été" (summer)

  • Output: 3 separate words: ["é", "t", "é"]

  • Input: "français" (French)

  • Output: 3 separate words: ["fran", "ç", "ais"]

  • Input: "n'êtes" (you are)

  • Output: 3 separate words: ["n", "'", "êtes"]

After the fix:

  • Input: "été" → Output: 1 word: ["été"]
  • Input: "français" → Output: 1 word: ["français"]
  • Input: "n'êtes" → Output: 1 word: ["n'êtes"]

✨ Solution

Changes Made

  1. Added unicodedata import for Unicode character category detection
  2. Implemented is_latin_char() helper function that properly identifies Latin letters with diacritics
  3. Modified get_word_info() method to include accented characters in word grouping logic
  4. Added apostrophe handling for French contractions

Technical Details

The fix uses Python's unicodedata module to check if a character belongs to the Letter category (L*) and has a Latin or French-based Unicode name. This ensures that characters like:

  • é (LATIN SMALL LETTER E WITH ACUTE)
  • è (LATIN SMALL LETTER E WITH GRAVE)
  • à (LATIN SMALL LETTER A WITH GRAVE)
  • ç (LATIN SMALL LETTER C WITH CEDILLA)

...are correctly recognized as word characters.

📁 Files Modified

Core Changes

  • ppocr/postprocess/rec_postprocess.py
    • Added unicodedata import
    • Added is_latin_char() function
    • Modified BaseRecLabelDecode.get_word_info() method

Test Files

  • test_french_accents.py (new)
    • Comprehensive test suite for French accented character handling
    • Tests various scenarios: simple accents, contractions, mixed text

🧪 Testing

Test Coverage

The included test script validates:

  • Simple accented words: été, élève
  • Words with ç: français
  • Contractions with apostrophes: n'êtes, C'était
  • Words with à: à demain
  • Complex sentences with multiple accents

Running Tests

python test_french_accents.py

🔄 Backward Compatibility

Fully backward compatible

This fix:

  • Only adds new functionality (recognition of accented characters)
  • Does not change behavior for existing ASCII text
  • Does not modify the API or function signatures
  • Uses standard library (unicodedata) - no new dependencies

All existing functionality remains unchanged. Code that worked before will continue to work exactly as before, with the added benefit of proper French (and other Latin-based language) support.

🌍 Impact

Languages Benefited

This fix improves OCR text recognition for all Latin-based languages that use diacritics, including:

  • French: é, è, ê, à, â, ù, û, ç, ï, etc.
  • Spanish: á, é, í, ó, ú, ñ, ü
  • Portuguese: ã, õ, á, é, í, ó, ú, â, ê, ô, ç
  • German: ä, ö, ü, ß
  • Italian: à, è, é, ì, ò, ù
  • And many more...

Use Cases

  • Document digitization in French-speaking regions
  • Multilingual OCR applications
  • Legal and administrative document processing
  • Educational material processing
  • International business document handling

📊 Performance Impact

Negligible performance impact:

  • The is_latin_char() function is only called for non-ASCII characters
  • Uses efficient unicodedata standard library functions
  • No additional loops or complex operations
  • Same time complexity as the original implementation

🔍 Code Quality

✅ Passes all pre-commit hooks:

  • black (code formatting)
  • flake8 (linting)
  • trailing whitespace check
  • line ending normalization

📝 Related Issues

This fix addresses the issue where French and other Latin-based language texts are incorrectly segmented during OCR post-processing, improving the accuracy and usability of PaddleOCR for international users.

✅ Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Comments added for complex logic
  • No breaking changes
  • Test script included
  • Documentation updated (this PR doc)
  • All pre-commit hooks pass

🙏 Acknowledgments

This fix was developed and tested on real-world French OCR scenarios, ensuring practical applicability and effectiveness.


Ready for review and merge! 🚀

@CLAassistant
Copy link

CLAassistant commented Nov 6, 2025

CLA assistant check
All committers have signed the CLA.

@paddle-bot
Copy link

paddle-bot bot commented Nov 6, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contrib/contributor Contributor-related discussion or task. label Nov 6, 2025
@Ihebdhouibi Ihebdhouibi force-pushed the fix-auto-split-french-words branch from 2113bca to c37b052 Compare November 7, 2025 08:23
@Ihebdhouibi
Copy link
Contributor Author

The test failure appears to be unrelated to this PR. The error is: ModuleNotFoundError: No module named 'langchain.docstore'

This is occurring in PaddleX's retriever module (paddlex/inference/pipelines/components/retriever/base.py:25), which is trying to import:

from langchain.docstore.document import Document

This import path was deprecated in langchain and moved to:

from langchain_core.documents import Document

Refer to API reference : https://reference.langchain.com/python/integrations/langchain_google_community/?h=document#langchain_google_community.DocumentAIWarehouseRetriever

@Ihebdhouibi
Copy link
Contributor Author

@luotao1 The issue is that the test-pr-gpu is failing because of wrong import at paddlex/inference/pipelines/components/retriever/base.py

It should be
from langchain_core.documents import Document

Instead of
from langchain.docstore.document import Document

Should I fix that as well ?

@liuhongen1234567
Copy link
Collaborator

LGTM

@Ihebdhouibi
Copy link
Contributor Author

LGTM

Yeah, issue is in langchain import as I mentioned

Should i launch a seperate PR for it ?

Added support for Latin characters with diacritics (é, è, à, ç, etc.) and French contractions (n'êtes) in word grouping logic of BaseRecLabelDecode.get_word_info().

This fix ensures that French words are no longer split at accented characters during OCR text recognition.
@Ihebdhouibi Ihebdhouibi force-pushed the fix-auto-split-french-words branch from c37b052 to 5fd1f25 Compare December 2, 2025 15:20
@luotao1
Copy link
Collaborator

luotao1 commented Dec 3, 2025

Should i launch a seperate PR for it ?

Yes, you could launch a seperate PR

- Moved test_french_accents.py to tests/ directory following project structure
- Removed invalid 'FRENCH' prefix from Unicode name check
- Unicode standard only uses 'LATIN' prefix for all Latin-based characters
- All French accented characters (é, è, à, ç, etc.) are correctly matched
- Verified with comprehensive character set including uppercase/lowercase variants
@Ihebdhouibi
Copy link
Contributor Author

@GreatV Thank you for the review! I've addressed both issues:

Changes Made:

  1. Moved test file: test_french_accents.pytests/test_french_accents.py (following project structure)
  2. Fixed Unicode check: Removed invalid "FRENCH" prefix from Unicode name validation

Examples of actual Unicode names:

  • é = LATIN SMALL LETTER E WITH ACUTE
  • à = LATIN SMALL LETTER A WITH GRAVE
  • ç = LATIN SMALL LETTER C WITH CEDILLA
  • œ = LATIN SMALL LIGATURE OE

Verification:

Tested with comprehensive French character set (32 characters including uppercase/lowercase variants):

éèêëàâäçùûüïîôöœÉÈÊËÀÂÄÇÙÛÜÏÎÔÖŒ

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current test file contains too many emojis, which is inconsistent with our project's style. I suggest removing them.

@Ihebdhouibi
Copy link
Contributor Author

The current test file contains too many emojis, which is inconsistent with our project's style. I suggest removing them.

Done

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GreatV GreatV merged commit 7ec94e7 into PaddlePaddle:main Dec 4, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contrib/contributor Contributor-related discussion or task. contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants