Skip to content

zbkilla/pdf-pipeline

Repository files navigation

PDF → Structured Data Playbook v2.1

Production-ready pipeline to transform large PDFs into structured data at scale.

  • True streaming (no full-document loads)
  • Parallel page batches with bounded memory
  • Idempotent (SHA-256 hashing) with DB UPSERT
  • Observability: Prometheus metrics, structured logs, health checks
  • Resilience: retries, circuit breaker, rate limiting
  • Security: env-based secrets, basic input validation

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -e .

# Put a PDF at data/input/report.pdf
cp your.pdf data/input/report.pdf

# Run end-to-end
python run_all_v2.py

# Health check
make health

# Optional: marker one-shot
make marker PDF="data/input/report.pdf"

Outputs

  • outputs/json/report.json — normalized JSON
  • outputs/markdown/report.md — Markdown “book”
  • outputs/csv/table_pX_Y.csv — extracted tables

See run_all_v2.py for full pipeline and src/ for modules.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published