@@ -54,18 +54,32 @@ cd klaudbiusz
5454# Evaluate all apps
5555uv run cli/evaluate_all.py
5656
57+ # Parallel evaluation (faster for large batches)
58+ uv run cli/evaluate_all.py -j 4 # Run 4 evaluations in parallel
59+ uv run cli/evaluate_all.py -j 0 # Auto-detect CPU count
60+ uv run cli/evaluate_all.py --parallel 8 # Long form
61+
5762# Partial evaluation (filter apps)
5863uv run cli/evaluate_all.py --limit 5 # First 5 apps
5964uv run cli/evaluate_all.py --apps app1 app2 # Specific apps
6065uv run cli/evaluate_all.py --pattern " customer*" # Pattern matching
6166uv run cli/evaluate_all.py --skip 10 --limit 5 # Skip first 10, evaluate next 5
67+ uv run cli/evaluate_all.py --start-from app5 # Start from specific app
68+
69+ # Custom directory
70+ uv run cli/evaluate_all.py --dir /path/to/apps # Evaluate apps in custom directory
71+
72+ # Staging environment (for testing)
73+ uv run cli/evaluate_all.py --staging # Log to staging MLflow experiment
6274
6375# Evaluate single app
6476uv run cli/evaluate_app.py ../app/customer-churn-analysis
6577```
6678
6779** Results are automatically logged to MLflow:** Navigate to ` ML → Experiments → /Shared/klaudbiusz-evaluations ` in Databricks UI / Googfooding.
6880
81+ ** Performance:** Parallel evaluation with ` -j ` can provide 3-4x speedup for large batches (e.g., 20 apps in 5 min vs 15+ min sequential).
82+
6983## Evaluation Framework
7084
7185We use ** 9 objective metrics** to measure autonomous deployability:
@@ -143,7 +157,7 @@ klaudbiusz/
143157
1441581 . Write natural language prompt
1451592 . Generate: ` uv run cli/single_run.py "your prompt" ` or ` uv run cli/bulk_run.py `
146- 3 . Evaluate: ` uv run cli/evaluate_all.py `
160+ 3 . Evaluate: ` uv run cli/evaluate_all.py -j 0 ` (parallel, auto-detect CPUs)
1471614 . Review: ` cat EVALUATION_REPORT.md `
1481625 . Deploy apps that pass checks
149163
@@ -169,9 +183,9 @@ shasum -a 256 -c klaudbiusz_evaluation_*.tar.gz.sha256
169183
170184## Requirements
171185
172- - Python 3.11 +
186+ - Python 3.12 +
173187- uv (Python package manager)
174- - Docker (for builds and runtime checks )
188+ - Docker (for Dagger containerized evaluations )
175189- Node.js 18+ (for generated apps)
176190- Databricks workspace with access token
177191
0 commit comments