This repository was archived by the owner on Jun 30, 2025. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat(config): added the possibility to configure tesseract ocr #2
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| import type { PartialExtractorConfig } from '../../src/types'; | ||
|
|
||
| export const config: PartialExtractorConfig = { | ||
| tesseract: { | ||
| languages: ['fra'], | ||
| }, | ||
| }; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| Vous savez, moi je ne crois pas qu'il y ait de bonne ou de | ||
| mauvaise situation. Moi, si je devais résumer ma vie | ||
| aujourd’hui avec vous, je dirais que c'est d'abord des | ||
| rencontres. Des gens qui m'ont tendu la main, peut-être à | ||
| un moment où je ne pouvais pas, où j'étais seul chez moi. | ||
| Et c'est assez curieux de se dire que les hasards, les | ||
| rencontres forgent une destinée... Parce que quand on a le | ||
| goût de la chose, quand on a le goût de la chose bien | ||
| faite, le beau geste, parfois on ne trouve pas | ||
| l'interlocuteur en face je dirais, le miroir qui vous aide à | ||
| avancer. Alors ça n’est pas mon cas, comme je disais là, | ||
| puisque moi au contraire, j'ai pu ; et je dis merci à la vie, | ||
| je lui dis merci, je chante la vie, je danse la vie... je ne suis | ||
| qu'amour ! Et finalement, quand des gens me disent « | ||
| Mais comment fais-tu pour avoir cette humanité ? », je | ||
| leur réponds très simplement que c'est ce goût de | ||
| l'amour, ce goût donc qui m'a poussé aujourd'hui à | ||
| entreprendre une construction mécanique... mais demain | ||
| qui sait ? Peut-être simplement à me mettre au service de | ||
| la communauté, à faire le don, le don de soi. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| import { describe, expect, test } from 'vitest'; | ||
| import { parseConfig } from './config'; | ||
|
|
||
| describe('config', () => { | ||
| describe('parseConfig', () => { | ||
| test('a non supported language for tesseract raises an error', () => { | ||
| expect(() => parseConfig({ rawConfig: { tesseract: { languages: ['invalid'] } } })).toThrow('Invalid languages for tesseract: invalid. Valid languages are: afr, amh, ara, asm, aze, aze_cyrl, bel, ben, bod, bos, bul, cat, ceb, ces, chi_sim, chi_tra, chr, cym, dan, deu, dzo, ell, eng, enm, epo, est, eus, fas, fin, fra, frk, frm, gle, glg, grc, guj, hat, heb, hin, hrv, hun, iku, ind, isl, ita, ita_old, jav, jpn, kan, kat, kat_old, kaz, khm, kir, kor, kur, lao, lat, lav, lit, mal, mar, mkd, mlt, msa, mya, nep, nld, nor, ori, pan, pol, por, pus, ron, rus, san, sin, slk, slv, spa, spa_old, sqi, srp, srp_latn, swa, swe, syr, tam, tel, tgk, tgl, tha, tir, tur, uig, ukr, urd, uzb, uzb_cyrl, vie, yid'); | ||
| }); | ||
|
|
||
| test('when the ocr language is not specified, undefined or empty array, the default `eng` is used', () => { | ||
| const { config } = parseConfig({ rawConfig: { tesseract: { languages: [] } } }); | ||
| expect(config.tesseract.languages).to.eql(['eng']); | ||
| }); | ||
|
|
||
| test('the ocr language can be a single language', () => { | ||
| const { config } = parseConfig({ rawConfig: { tesseract: { languages: ['fra'] } } }); | ||
| expect(config.tesseract.languages).to.eql(['fra']); | ||
| }); | ||
| }); | ||
| }); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| import type { ExtractorConfig, PartialExtractorConfig } from './types'; | ||
| import { languages as tesseractLanguages } from 'tesseract.js'; | ||
|
|
||
| const languages = Object.values(tesseractLanguages); | ||
|
|
||
| export function parseConfig({ rawConfig = {} }: { rawConfig?: PartialExtractorConfig } = {}): { config: ExtractorConfig } { | ||
| const ocrLanguages = rawConfig.tesseract?.languages ?? []; | ||
| const invalidLanguages = ocrLanguages.filter(language => !languages.includes(language)); | ||
|
|
||
| if (invalidLanguages.length > 0) { | ||
| throw new Error(`Invalid languages for tesseract: ${invalidLanguages.join(', ')}. Valid languages are: ${languages.join(', ')}`); | ||
| } | ||
|
|
||
| return { | ||
| config: { | ||
| tesseract: { | ||
| languages: ocrLanguages.length > 0 ? ocrLanguages : ['eng'], | ||
| }, | ||
| }, | ||
| }; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,9 +1,11 @@ | ||
| import type { ExtractorConfig } from './types'; | ||
|
|
||
| export type ExtractorDefinition = ReturnType<typeof defineTextExtractor>; | ||
|
|
||
| export function defineTextExtractor(args: { | ||
| name: string; | ||
| mimeTypes: string[]; | ||
| extract: (args: { arrayBuffer: ArrayBuffer }) => Promise<{ content: string }>; | ||
| extract: (args: { arrayBuffer: ArrayBuffer; config: ExtractorConfig }) => Promise<{ content: string }>; | ||
| }) { | ||
| return args; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| import type { DeepPartial } from '@corentinth/chisels'; | ||
|
|
||
| export type ExtractorConfig = { | ||
| tesseract: { | ||
| languages: string[]; | ||
| }; | ||
| }; | ||
|
|
||
| export type PartialExtractorConfig = undefined | DeepPartial<ExtractorConfig>; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| export * from 'tesseract.js'; | ||
|
|
||
| declare module 'tesseract.js' { | ||
| type LanguageKey = 'AFR' | 'AMH' | 'ARA' | 'ASM' | 'AZE' | 'AZE_CYRL' | 'BEL' | 'BEN' | 'BOD' | 'BOS' | 'BUL' | 'CAT' | 'CEB' | 'CES' | 'CHI_SIM' | 'CHI_TRA' | 'CHR' | 'CYM' | 'DAN' | 'DEU' | 'DZO' | 'ELL' | 'ENG' | 'ENM' | 'EPO' | 'EST' | 'EUS' | 'FAS' | 'FIN' | 'FRA' | 'FRK' | 'FRM' | 'GLE' | 'GLG' | 'GRC' | 'GUJ' | 'HAT' | 'HEB' | 'HIN' | 'HRV' | 'HUN' | 'IKU' | 'IND' | 'ISL' | 'ITA' | 'ITA_OLD' | 'JAV' | 'JPN' | 'KAN' | 'KAT' | 'KAT_OLD' | 'KAZ' | 'KHM' | 'KIR' | 'KOR' | 'KUR' | 'LAO' | 'LAT' | 'LAV' | 'LIT' | 'MAL' | 'MAR' | 'MKD' | 'MLT' | 'MSA' | 'MYA' | 'NEP' | 'NLD' | 'NOR' | 'ORI' | 'PAN' | 'POL' | 'POR' | 'PUS' | 'RON' | 'RUS' | 'SAN' | 'SIN' | 'SLK' | 'SLV' | 'SPA' | 'SPA_OLD' | 'SQI' | 'SRP' | 'SRP_LATN' | 'SWA' | 'SWE' | 'SYR' | 'TAM' | 'TEL' | 'TGK' | 'TGL' | 'THA' | 'TIR' | 'TUR' | 'UIG' | 'UKR' | 'URD' | 'UZB' | 'UZB_CYRL' | 'VIE' | 'YID'; | ||
|
|
||
| export const languages: Record<LanguageKey, string>; | ||
| } |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider using the globally imported 'expect' instead of destructuring it from the test context in the test.concurrent callback, as this pattern may be non-standard for Vitest.