Skip to content

Conversation

@marklondon999
Copy link
Contributor

@marklondon999 marklondon999 commented Jul 30, 2025

Title: feat: Add configuration-driven evaluation runner script

Background

As part of breaking down the functionality of the original SA-160 PR into smaller parts, this provides the final running of the evaluation against a configuration file of the output file created previously up stream in the data pipeline.

Description

This pull request introduces a new, flexible evaluation runner script (metrics_runner.py) for the Survey Assist project. This script allows for running multiple, distinct evaluation scenarios that are defined in an external TOML configuration file, making performance analysis more efficient, repeatable, and easier to modify.

The runner iterates through each case defined in the config, dynamically setting up the comparison between model predictions and clerical coder labels (e.g., top 1 vs. top 1, top 1 vs. any of top 5, etc.). It then calculates and prints a clear, formatted report for each scenario, including full match accuracy, 2-digit accuracy, and the Jaccard similarity index.

This approach replaces hard-coded evaluation logic with a configuration-driven system for ongoing model assessment and experimentation.


Key Changes

  • ✨ New Script (metrics_runner.py): A command-line tool that orchestrates the entire evaluation process. It handles loading data, parsing the configuration, and printing results.
  • ⚙️ TOML Configuration (evaluation_config.toml): Introduces a new configuration file where users can easily define or modify evaluation cases. Each case can specify:
    • The number of model predictions (LLMs) to consider.
    • The number of clerical codes (CCs) to compare against.
    • Whether to filter for only Unambiguous records.
  • 🧠 Dynamic Evaluation Logic: The script dynamically creates a ColumnConfig for each case based on the TOML file, ensuring the LabelAccuracy analyzer receives the correct parameters for its calculations.

How to Test

  1. Use evaluation_config.toml to define the metrics wanted. An example is in scripts/

  2. Ensure you have a processed data file (e.g., in data/).

  3. Run the script from your terminal with the paths to your data and config files:

    python scripts/metrics_runner.py data/final_processed_output.csv scripts/evaluation_config.toml
  4. Verify the Output:

    • Check that a formatted report is printed for each of the evaluation cases defined in the TOML file.
    • Confirm that the logs correctly state the parameters for each case (e.g., "Running Case 1: Match of top CC vs top SA...").

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • [✅] Code is formatted using Black
  • [ ✅] Imports are sorted using isort
  • [ ✅] Code passes linting with Ruff, Pylint, and Mypy
  • [ ✅] Security checks pass using Bandit
  • [ ✅] API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • [ ✅] DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

python scripts/metrics_runner.py  data/final_processed_output.csv scripts/evaluation_config.toml

Expected output:

▶️  Running Case 1: Match of top CC vs top SA on unambiguously Codable

Full Match Details:
🎯 Full Match Accuracy: 55.20% (706/1279)

2-Digit Match Details:
🔢 2-Digit Match Accuracy: 70.80% (906/1279)

--- 📊 Overall Summary ---
Total Records Analyzed:     1291
Jaccard Similarity Index:   0.5469
Overall Accuracy (Full):    55.20%
------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants