Skip to content
This repository was archived by the owner on Jun 30, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes
7 changes: 7 additions & 0 deletions fixtures/009-png-with-french-text/009.config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import type { PartialExtractorConfig } from '../../src/types';

export const config: PartialExtractorConfig = {
tesseract: {
languages: ['fra'],
},
};
20 changes: 20 additions & 0 deletions fixtures/009-png-with-french-text/009.expected.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Vous savez, moi je ne crois pas qu'il y ait de bonne ou de
mauvaise situation. Moi, si je devais résumer ma vie
aujourd’hui avec vous, je dirais que c'est d'abord des
rencontres. Des gens qui m'ont tendu la main, peut-être à
un moment où je ne pouvais pas, où j'étais seul chez moi.
Et c'est assez curieux de se dire que les hasards, les
rencontres forgent une destinée... Parce que quand on a le
goût de la chose, quand on a le goût de la chose bien
faite, le beau geste, parfois on ne trouve pas
l'interlocuteur en face je dirais, le miroir qui vous aide à
avancer. Alors ça n’est pas mon cas, comme je disais là,
puisque moi au contraire, j'ai pu ; et je dis merci à la vie,
je lui dis merci, je chante la vie, je danse la vie... je ne suis
qu'amour ! Et finalement, quand des gens me disent «
Mais comment fais-tu pour avoir cette humanité ? », je
leur réponds très simplement que c'est ce goût de
l'amour, ce goût donc qui m'a poussé aujourd'hui à
entreprendre une construction mécanique... mais demain
qui sait ? Peut-être simplement à me mettre au service de
la communauté, à faire le don, le don de soi.
Binary file added fixtures/009-png-with-french-text/009.input.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
"release": "bumpp --commit --tag --push"
},
"dependencies": {
"@corentinth/chisels": "^1.3.1",
"tesseract.js": "^6.0.0",
"unpdf": "^0.12.1"
},
Expand Down
8 changes: 8 additions & 0 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

20 changes: 20 additions & 0 deletions src/config.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import { describe, expect, test } from 'vitest';
import { parseConfig } from './config';

describe('config', () => {
describe('parseConfig', () => {
test('a non supported language for tesseract raises an error', () => {
expect(() => parseConfig({ rawConfig: { tesseract: { languages: ['invalid'] } } })).toThrow('Invalid languages for tesseract: invalid. Valid languages are: afr, amh, ara, asm, aze, aze_cyrl, bel, ben, bod, bos, bul, cat, ceb, ces, chi_sim, chi_tra, chr, cym, dan, deu, dzo, ell, eng, enm, epo, est, eus, fas, fin, fra, frk, frm, gle, glg, grc, guj, hat, heb, hin, hrv, hun, iku, ind, isl, ita, ita_old, jav, jpn, kan, kat, kat_old, kaz, khm, kir, kor, kur, lao, lat, lav, lit, mal, mar, mkd, mlt, msa, mya, nep, nld, nor, ori, pan, pol, por, pus, ron, rus, san, sin, slk, slv, spa, spa_old, sqi, srp, srp_latn, swa, swe, syr, tam, tel, tgk, tgl, tha, tir, tur, uig, ukr, urd, uzb, uzb_cyrl, vie, yid');
});

test('when the ocr language is not specified, undefined or empty array, the default `eng` is used', () => {
const { config } = parseConfig({ rawConfig: { tesseract: { languages: [] } } });
expect(config.tesseract.languages).to.eql(['eng']);
});

test('the ocr language can be a single language', () => {
const { config } = parseConfig({ rawConfig: { tesseract: { languages: ['fra'] } } });
expect(config.tesseract.languages).to.eql(['fra']);
});
});
});
21 changes: 21 additions & 0 deletions src/config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import type { ExtractorConfig, PartialExtractorConfig } from './types';
import { languages as tesseractLanguages } from 'tesseract.js';

const languages = Object.values(tesseractLanguages);

export function parseConfig({ rawConfig = {} }: { rawConfig?: PartialExtractorConfig } = {}): { config: ExtractorConfig } {
const ocrLanguages = rawConfig.tesseract?.languages ?? [];
const invalidLanguages = ocrLanguages.filter(language => !languages.includes(language));

if (invalidLanguages.length > 0) {
throw new Error(`Invalid languages for tesseract: ${invalidLanguages.join(', ')}. Valid languages are: ${languages.join(', ')}`);
}

return {
config: {
tesseract: {
languages: ocrLanguages.length > 0 ? ocrLanguages : ['eng'],
},
},
};
}
4 changes: 3 additions & 1 deletion src/extractors.models.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import type { ExtractorConfig } from './types';

export type ExtractorDefinition = ReturnType<typeof defineTextExtractor>;

export function defineTextExtractor(args: {
name: string;
mimeTypes: string[];
extract: (args: { arrayBuffer: ArrayBuffer }) => Promise<{ content: string }>;
extract: (args: { arrayBuffer: ArrayBuffer; config: ExtractorConfig }) => Promise<{ content: string }>;
}) {
return args;
}
29 changes: 19 additions & 10 deletions src/extractors.usecases.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import { glob } from 'tinyglobby';
import { describe, expect, test } from 'vitest';
import { extractText, extractTextFromBlob, extractTextFromFile } from './extractors.usecases';

const fixtures = await glob(['fixtures/*', '!fixtures/*.expected']);
const fixturesDir = await glob(['fixtures/*'], { onlyDirectories: true });

describe('extractors usecases', () => {
describe('extractText', () => {
Expand Down Expand Up @@ -32,22 +32,31 @@ describe('extractors usecases', () => {
});

describe('text is extracted from fixtures files', async () => {
test('at least one fixture file is found', () => {
expect(fixtures.length).to.be.greaterThan(0);
test('at least one fixture file is present', () => {
expect(fixturesDir.length).to.be.greaterThan(0);
});

for (const fixture of fixtures) {
test(`fixture ${fixture}`, async () => {
const arrayBuffer = (await fs.readFile(fixture)).buffer as ArrayBuffer;
const mimeType = mime.getType(fixture);
for (const fixture of fixturesDir) {
// use test.concurrent to run the tests in parallel -> need to use the provided expect
test.concurrent(`fixture ${fixture}`, async ({ expect }) => {
Copy link

Copilot AI Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider using the globally imported 'expect' instead of destructuring it from the test context in the test.concurrent callback, as this pattern may be non-standard for Vitest.

Suggested change
test.concurrent(`fixture ${fixture}`, async ({ expect }) => {
test.concurrent(`fixture ${fixture}`, async () => {

Copilot uses AI. Check for mistakes.
const fixtureFilesPaths = await glob([`${fixture}/*`]);
const inputFilePath = fixtureFilesPaths.find(name => name.match(/\/\d{3}\.input\.\w+$/));
const configFilePath = fixtureFilesPaths.find(name => name.match(/\/\d{3}\.config\.ts$/));

const { textContent, error, extractorName } = await extractText({ arrayBuffer, mimeType });
const config = configFilePath ? (await import(configFilePath)).config : undefined;

const arrayBuffer = (await fs.readFile(inputFilePath)).buffer as ArrayBuffer;
const mimeType = mime.getType(inputFilePath);

const { textContent, error, extractorName } = await extractText({ arrayBuffer, mimeType, config });

expect(error).to.eql(undefined);
expect(extractorName).to.not.eql(undefined);

const snapshotFilename = fixture.split('/').pop().replace(/\..*$/, '.expected');
await expect(textContent).toMatchFileSnapshot(`../fixtures/${snapshotFilename}`, 'Fixture does not match snapshot');
const fixtureNumber = fixture.split('/').filter(Boolean).pop().slice(0, 3);
const expectedFilePath = `../${fixture}/${fixtureNumber}.expected.txt`;

await expect(textContent).toMatchFileSnapshot(expectedFilePath, 'Fixture does not match snapshot');
});
}
});
Expand Down
7 changes: 5 additions & 2 deletions src/extractors.usecases.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
import type { ExtractorConfig } from './types';
import { parseConfig } from './config';
import { getExtractor } from './extractors.registry';

export async function extractText({ arrayBuffer, mimeType }: { arrayBuffer: ArrayBuffer; mimeType: string }): Promise<{
export async function extractText({ arrayBuffer, mimeType, config: rawConfig }: { arrayBuffer: ArrayBuffer; mimeType: string; config?: ExtractorConfig }): Promise<{
extractorName: string | undefined;
textContent: string | undefined;
error?: Error;
}> {
const { config } = parseConfig({ rawConfig });
const { extractor } = getExtractor({ mimeType });

if (!extractor) {
Expand All @@ -15,7 +18,7 @@ export async function extractText({ arrayBuffer, mimeType }: { arrayBuffer: Arra
}

try {
const { content } = await extractor.extract({ arrayBuffer });
const { content } = await extractor.extract({ arrayBuffer, config });

return {
extractorName: extractor.name,
Expand Down
6 changes: 4 additions & 2 deletions src/extractors/img.extractor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,12 @@ export const imageExtractorDefinition = defineTextExtractor({
'image/webp',
'image/gif',
],
extract: async ({ arrayBuffer }) => {
extract: async ({ arrayBuffer, config }) => {
const { languages } = config.tesseract;

const buffer = Buffer.from(arrayBuffer);

const worker = await createWorker();
const worker = await createWorker(languages);

const { data: { text } } = await worker.recognize(buffer);
await worker.terminate();
Expand Down
9 changes: 9 additions & 0 deletions src/types.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import type { DeepPartial } from '@corentinth/chisels';

export type ExtractorConfig = {
tesseract: {
languages: string[];
};
};

export type PartialExtractorConfig = undefined | DeepPartial<ExtractorConfig>;
7 changes: 7 additions & 0 deletions src/types/tesseract.d.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
export * from 'tesseract.js';

declare module 'tesseract.js' {
type LanguageKey = 'AFR' | 'AMH' | 'ARA' | 'ASM' | 'AZE' | 'AZE_CYRL' | 'BEL' | 'BEN' | 'BOD' | 'BOS' | 'BUL' | 'CAT' | 'CEB' | 'CES' | 'CHI_SIM' | 'CHI_TRA' | 'CHR' | 'CYM' | 'DAN' | 'DEU' | 'DZO' | 'ELL' | 'ENG' | 'ENM' | 'EPO' | 'EST' | 'EUS' | 'FAS' | 'FIN' | 'FRA' | 'FRK' | 'FRM' | 'GLE' | 'GLG' | 'GRC' | 'GUJ' | 'HAT' | 'HEB' | 'HIN' | 'HRV' | 'HUN' | 'IKU' | 'IND' | 'ISL' | 'ITA' | 'ITA_OLD' | 'JAV' | 'JPN' | 'KAN' | 'KAT' | 'KAT_OLD' | 'KAZ' | 'KHM' | 'KIR' | 'KOR' | 'KUR' | 'LAO' | 'LAT' | 'LAV' | 'LIT' | 'MAL' | 'MAR' | 'MKD' | 'MLT' | 'MSA' | 'MYA' | 'NEP' | 'NLD' | 'NOR' | 'ORI' | 'PAN' | 'POL' | 'POR' | 'PUS' | 'RON' | 'RUS' | 'SAN' | 'SIN' | 'SLK' | 'SLV' | 'SPA' | 'SPA_OLD' | 'SQI' | 'SRP' | 'SRP_LATN' | 'SWA' | 'SWE' | 'SYR' | 'TAM' | 'TEL' | 'TGK' | 'TGL' | 'THA' | 'TIR' | 'TUR' | 'UIG' | 'UKR' | 'URD' | 'UZB' | 'UZB_CYRL' | 'VIE' | 'YID';

export const languages: Record<LanguageKey, string>;
}