Production-ready pipeline to transform large PDFs into structured data at scale.
- True streaming (no full-document loads)
- Parallel page batches with bounded memory
- Idempotent (SHA-256 hashing) with DB UPSERT
- Observability: Prometheus metrics, structured logs, health checks
- Resilience: retries, circuit breaker, rate limiting
- Security: env-based secrets, basic input validation
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Put a PDF at data/input/report.pdf
cp your.pdf data/input/report.pdf
# Run end-to-end
python run_all_v2.py
# Health check
make health
# Optional: marker one-shot
make marker PDF="data/input/report.pdf"outputs/json/report.json— normalized JSONoutputs/markdown/report.md— Markdown “book”outputs/csv/table_pX_Y.csv— extracted tables
See run_all_v2.py for full pipeline and src/ for modules.