-
Notifications
You must be signed in to change notification settings - Fork 9.5k
Fix: Prevent auto-splitting of French accented words in text recognition #16994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Prevent auto-splitting of French accented words in text recognition #16994
Conversation
|
Thanks for your contribution! |
2113bca to
c37b052
Compare
|
The test failure appears to be unrelated to this PR. The error is: ModuleNotFoundError: No module named 'langchain.docstore' This is occurring in PaddleX's retriever module ( from langchain.docstore.document import DocumentThis import path was deprecated in langchain and moved to: Refer to API reference : https://reference.langchain.com/python/integrations/langchain_google_community/?h=document#langchain_google_community.DocumentAIWarehouseRetriever |
|
@luotao1 The issue is that the test-pr-gpu is failing because of wrong import at paddlex/inference/pipelines/components/retriever/base.py It should be Instead of Should I fix that as well ? |
|
LGTM |
Yeah, issue is in langchain import as I mentioned Should i launch a seperate PR for it ? |
Added support for Latin characters with diacritics (é, è, à, ç, etc.) and French contractions (n'êtes) in word grouping logic of BaseRecLabelDecode.get_word_info(). This fix ensures that French words are no longer split at accented characters during OCR text recognition.
c37b052 to
5fd1f25
Compare
Yes, you could launch a seperate PR |
- Moved test_french_accents.py to tests/ directory following project structure - Removed invalid 'FRENCH' prefix from Unicode name check - Unicode standard only uses 'LATIN' prefix for all Latin-based characters - All French accented characters (é, è, à, ç, etc.) are correctly matched - Verified with comprehensive character set including uppercase/lowercase variants
|
@GreatV Thank you for the review! I've addressed both issues: Changes Made:
Examples of actual Unicode names:
Verification:Tested with comprehensive French character set (32 characters including uppercase/lowercase variants): éèêëàâäçùûüïîôöœÉÈÊËÀÂÄÇÙÛÜÏÎÔÖŒ |
GreatV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current test file contains too many emojis, which is inconsistent with our project's style. I suggest removing them.
Done |
GreatV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition
📋 Summary
This PR fixes a bug where French words containing accented characters (é, è, à, ç, etc.) and contractions (n'êtes, l'été) were incorrectly split at each accented character during OCR text recognition word grouping.
🐛 Problem Description
Issue
The
BaseRecLabelDecode.get_word_info()method inppocr/postprocess/rec_postprocess.pyonly recognized basic ASCII letters (a-z, A-Z) as word characters. Accented characters used in French and other Latin-based languages were incorrectly classified as "splitters", causing words to be broken apart.Example of the Bug
Before the fix:
Input:
"été"(summer)Output: 3 separate words:
["é", "t", "é"]❌Input:
"français"(French)Output: 3 separate words:
["fran", "ç", "ais"]❌Input:
"n'êtes"(you are)Output: 3 separate words:
["n", "'", "êtes"]❌After the fix:
"été"→ Output: 1 word:["été"]✅"français"→ Output: 1 word:["français"]✅"n'êtes"→ Output: 1 word:["n'êtes"]✅✨ Solution
Changes Made
unicodedataimport for Unicode character category detectionis_latin_char()helper function that properly identifies Latin letters with diacriticsget_word_info()method to include accented characters in word grouping logicTechnical Details
The fix uses Python's
unicodedatamodule to check if a character belongs to the Letter category (L*) and has a Latin or French-based Unicode name. This ensures that characters like:...are correctly recognized as word characters.
📁 Files Modified
Core Changes
ppocr/postprocess/rec_postprocess.pyunicodedataimportis_latin_char()functionBaseRecLabelDecode.get_word_info()methodTest Files
test_french_accents.py(new)🧪 Testing
Test Coverage
The included test script validates:
été,élèvefrançaisn'êtes,C'étaità demainRunning Tests
🔄 Backward Compatibility
✅ Fully backward compatible
This fix:
unicodedata) - no new dependenciesAll existing functionality remains unchanged. Code that worked before will continue to work exactly as before, with the added benefit of proper French (and other Latin-based language) support.
🌍 Impact
Languages Benefited
This fix improves OCR text recognition for all Latin-based languages that use diacritics, including:
Use Cases
📊 Performance Impact
Negligible performance impact:
is_latin_char()function is only called for non-ASCII charactersunicodedatastandard library functions🔍 Code Quality
✅ Passes all pre-commit hooks:
📝 Related Issues
This fix addresses the issue where French and other Latin-based language texts are incorrectly segmented during OCR post-processing, improving the accuracy and usability of PaddleOCR for international users.
✅ Checklist
🙏 Acknowledgments
This fix was developed and tested on real-world French OCR scenarios, ensuring practical applicability and effectiveness.
Ready for review and merge! 🚀