Fix: Prevent auto-splitting of French accented words in text recognition #16994

Ihebdhouibi · 2025-11-06T10:26:08Z

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

This PR fixes a bug where French words containing accented characters (é, è, à, ç, etc.) and contractions (n'êtes, l'été) were incorrectly split at each accented character during OCR text recognition word grouping.

🐛 Problem Description

Issue

The BaseRecLabelDecode.get_word_info() method in ppocr/postprocess/rec_postprocess.py only recognized basic ASCII letters (a-z, A-Z) as word characters. Accented characters used in French and other Latin-based languages were incorrectly classified as "splitters", causing words to be broken apart.

Example of the Bug

Before the fix:

Input: "été" (summer)
Output: 3 separate words: ["é", "t", "é"] ❌
Input: "français" (French)
Output: 3 separate words: ["fran", "ç", "ais"] ❌
Input: "n'êtes" (you are)
Output: 3 separate words: ["n", "'", "êtes"] ❌

After the fix:

Input: "été" → Output: 1 word: ["été"] ✅
Input: "français" → Output: 1 word: ["français"] ✅
Input: "n'êtes" → Output: 1 word: ["n'êtes"] ✅

✨ Solution

Changes Made

Added unicodedata import for Unicode character category detection
Implemented is_latin_char() helper function that properly identifies Latin letters with diacritics
Modified get_word_info() method to include accented characters in word grouping logic
Added apostrophe handling for French contractions

Technical Details

The fix uses Python's unicodedata module to check if a character belongs to the Letter category (L*) and has a Latin or French-based Unicode name. This ensures that characters like:

é (LATIN SMALL LETTER E WITH ACUTE)
è (LATIN SMALL LETTER E WITH GRAVE)
à (LATIN SMALL LETTER A WITH GRAVE)
ç (LATIN SMALL LETTER C WITH CEDILLA)

...are correctly recognized as word characters.

📁 Files Modified

Core Changes

ppocr/postprocess/rec_postprocess.py
- Added unicodedata import
- Added is_latin_char() function
- Modified BaseRecLabelDecode.get_word_info() method

Test Files

test_french_accents.py (new)
- Comprehensive test suite for French accented character handling
- Tests various scenarios: simple accents, contractions, mixed text

🧪 Testing

Test Coverage

The included test script validates:

Simple accented words: été, élève
Words with ç: français
Contractions with apostrophes: n'êtes, C'était
Words with à: à demain
Complex sentences with multiple accents

Running Tests

python test_french_accents.py

🔄 Backward Compatibility

✅ Fully backward compatible

This fix:

Only adds new functionality (recognition of accented characters)
Does not change behavior for existing ASCII text
Does not modify the API or function signatures
Uses standard library (unicodedata) - no new dependencies

All existing functionality remains unchanged. Code that worked before will continue to work exactly as before, with the added benefit of proper French (and other Latin-based language) support.

🌍 Impact

Languages Benefited

This fix improves OCR text recognition for all Latin-based languages that use diacritics, including:

French: é, è, ê, à, â, ù, û, ç, ï, etc.
Spanish: á, é, í, ó, ú, ñ, ü
Portuguese: ã, õ, á, é, í, ó, ú, â, ê, ô, ç
German: ä, ö, ü, ß
Italian: à, è, é, ì, ò, ù
And many more...

Use Cases

Document digitization in French-speaking regions
Multilingual OCR applications
Legal and administrative document processing
Educational material processing
International business document handling

📊 Performance Impact

Negligible performance impact:

The is_latin_char() function is only called for non-ASCII characters
Uses efficient unicodedata standard library functions
No additional loops or complex operations
Same time complexity as the original implementation

🔍 Code Quality

✅ Passes all pre-commit hooks:

black (code formatting)
flake8 (linting)
trailing whitespace check
line ending normalization

📝 Related Issues

This fix addresses the issue where French and other Latin-based language texts are incorrectly segmented during OCR post-processing, improving the accuracy and usability of PaddleOCR for international users.

✅ Checklist

🙏 Acknowledgments

This fix was developed and tested on real-world French OCR scenarios, ensuring practical applicability and effectiveness.

Ready for review and merge! 🚀

CLAassistant · 2025-11-06T10:26:16Z

All committers have signed the CLA.

paddle-bot · 2025-11-06T10:26:16Z

Thanks for your contribution!

Ihebdhouibi · 2025-11-25T13:00:46Z

The test failure appears to be unrelated to this PR. The error is: ModuleNotFoundError: No module named 'langchain.docstore'

This is occurring in PaddleX's retriever module (paddlex/inference/pipelines/components/retriever/base.py:25), which is trying to import:

from langchain.docstore.document import Document

This import path was deprecated in langchain and moved to:

from langchain_core.documents import Document

Refer to API reference : https://reference.langchain.com/python/integrations/langchain_google_community/?h=document#langchain_google_community.DocumentAIWarehouseRetriever

Ihebdhouibi · 2025-12-01T13:15:30Z

@luotao1 The issue is that the test-pr-gpu is failing because of wrong import at paddlex/inference/pipelines/components/retriever/base.py

It should be
from langchain_core.documents import Document

Instead of
from langchain.docstore.document import Document

Should I fix that as well ?

liuhongen1234567 · 2025-12-02T03:02:00Z

LGTM

Ihebdhouibi · 2025-12-02T14:12:46Z

LGTM

Yeah, issue is in langchain import as I mentioned

Should i launch a seperate PR for it ?

tests/test_french_accents.py

ppocr/postprocess/rec_postprocess.py

Added support for Latin characters with diacritics (é, è, à, ç, etc.) and French contractions (n'êtes) in word grouping logic of BaseRecLabelDecode.get_word_info(). This fix ensures that French words are no longer split at accented characters during OCR text recognition.

luotao1 · 2025-12-03T02:06:39Z

Should i launch a seperate PR for it ?

Yes, you could launch a seperate PR

- Moved test_french_accents.py to tests/ directory following project structure - Removed invalid 'FRENCH' prefix from Unicode name check - Unicode standard only uses 'LATIN' prefix for all Latin-based characters - All French accented characters (é, è, à, ç, etc.) are correctly matched - Verified with comprehensive character set including uppercase/lowercase variants

Ihebdhouibi · 2025-12-03T14:42:42Z

@GreatV Thank you for the review! I've addressed both issues:

Changes Made:

✅ Moved test file: test_french_accents.py → tests/test_french_accents.py (following project structure)
✅ Fixed Unicode check: Removed invalid "FRENCH" prefix from Unicode name validation

Examples of actual Unicode names:

é = LATIN SMALL LETTER E WITH ACUTE
à = LATIN SMALL LETTER A WITH GRAVE
ç = LATIN SMALL LETTER C WITH CEDILLA
œ = LATIN SMALL LIGATURE OE

Verification:

Tested with comprehensive French character set (32 characters including uppercase/lowercase variants):

éèêëàâäçùûüïîôöœÉÈÊËÀÂÄÇÙÛÜÏÎÔÖŒ

GreatV

The current test file contains too many emojis, which is inconsistent with our project's style. I suggest removing them.

Ihebdhouibi · 2025-12-04T09:51:10Z

The current test file contains too many emojis, which is inconsistent with our project's style. I suggest removing them.

Done

GreatV

LGTM

paddle-bot bot added the contrib/contributor Contributor-related discussion or task. label Nov 6, 2025

Ihebdhouibi force-pushed the fix-auto-split-french-words branch from 2113bca to c37b052 Compare November 7, 2025 08:23

paddle-bot bot added the contributor label Nov 12, 2025

luotao1 assigned luotao1 and liuhongen1234567 Nov 25, 2025

liuhongen1234567 requested a review from GreatV December 2, 2025 03:01

GreatV reviewed Dec 2, 2025

View reviewed changes

tests/test_french_accents.py Show resolved Hide resolved

ppocr/postprocess/rec_postprocess.py Outdated Show resolved Hide resolved

Ihebdhouibi force-pushed the fix-auto-split-french-words branch from c37b052 to 5fd1f25 Compare December 2, 2025 15:20

Ihebdhouibi added 2 commits December 3, 2025 15:29

moved test file and fix some style errors

608e1b4

GreatV reviewed Dec 4, 2025

View reviewed changes

style: Remove emojis from test file to maintain project code style

f53dfb6

GreatV approved these changes Dec 4, 2025

View reviewed changes

GreatV merged commit 7ec94e7 into PaddlePaddle:main Dec 4, 2025
4 checks passed

Fix: Prevent auto-splitting of French accented words in text recognition #16994

Fix: Prevent auto-splitting of French accented words in text recognition #16994

Conversation

Ihebdhouibi commented Nov 6, 2025

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

🐛 Problem Description

Issue

Example of the Bug

✨ Solution

Changes Made

Technical Details

📁 Files Modified

Core Changes

Test Files

🧪 Testing

Test Coverage

Running Tests

🔄 Backward Compatibility

🌍 Impact

Languages Benefited

Use Cases

📊 Performance Impact

🔍 Code Quality

📝 Related Issues

✅ Checklist

🙏 Acknowledgments

Uh oh!

CLAassistant commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Nov 6, 2025

Uh oh!

Ihebdhouibi commented Nov 25, 2025

Uh oh!

Ihebdhouibi commented Dec 1, 2025

Uh oh!

liuhongen1234567 commented Dec 2, 2025

Uh oh!

Ihebdhouibi commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

luotao1 commented Dec 3, 2025

Uh oh!

Ihebdhouibi commented Dec 3, 2025

Changes Made:

Verification:

Uh oh!

GreatV left a comment

Choose a reason for hiding this comment

Uh oh!

Ihebdhouibi commented Dec 4, 2025

Uh oh!

GreatV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Nov 6, 2025 •

edited

Loading