SA-160_Runner Code for evaluation #21

marklondon999 · 2025-07-30T16:20:08Z

Title: feat: Add configuration-driven evaluation runner script

Background

As part of breaking down the functionality of the original SA-160 PR into smaller parts, this provides the final running of the evaluation against a configuration file of the output file created previously up stream in the data pipeline.

Description

This pull request introduces a new, flexible evaluation runner script (metrics_runner.py) for the Survey Assist project. This script allows for running multiple, distinct evaluation scenarios that are defined in an external TOML configuration file, making performance analysis more efficient, repeatable, and easier to modify.

The runner iterates through each case defined in the config, dynamically setting up the comparison between model predictions and clerical coder labels (e.g., top 1 vs. top 1, top 1 vs. any of top 5, etc.). It then calculates and prints a clear, formatted report for each scenario, including full match accuracy, 2-digit accuracy, and the Jaccard similarity index.

This approach replaces hard-coded evaluation logic with a configuration-driven system for ongoing model assessment and experimentation.

Key Changes

✨ New Script (metrics_runner.py): A command-line tool that orchestrates the entire evaluation process. It handles loading data, parsing the configuration, and printing results.
⚙️ TOML Configuration (evaluation_config.toml): Introduces a new configuration file where users can easily define or modify evaluation cases. Each case can specify:
- The number of model predictions (LLMs) to consider.
- The number of clerical codes (CCs) to compare against.
- Whether to filter for only Unambiguous records.
🧠 Dynamic Evaluation Logic: The script dynamically creates a ColumnConfig for each case based on the TOML file, ensuring the LabelAccuracy analyzer receives the correct parameters for its calculations.

How to Test

Use evaluation_config.toml to define the metrics wanted. An example is in scripts/
Ensure you have a processed data file (e.g., in data/).

Run the script from your terminal with the paths to your data and config files:

python scripts/metrics_runner.py data/final_processed_output.csv scripts/evaluation_config.toml

Verify the Output:
- Check that a formatted report is printed for each of the evaluation cases defined in the TOML file.
- Confirm that the logs correctly state the parameters for each case (e.g., "Running Case 1: Match of top CC vs top SA...").

✅ Checklist

Please confirm you've completed these checks before requesting a review.

[✅] Code is formatted using Black
[ ✅] Imports are sorted using isort
[ ✅] Code passes linting with Ruff, Pylint, and Mypy
[ ✅] Security checks pass using Bandit
[ ✅] API and Unit tests are written and pass using pytest
Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
[ ✅] DocStrings follow Google-style and are added as per Pylint recommendations
Documentation has been updated if needed

🔍 How to Test

python scripts/metrics_runner.py  data/final_processed_output.csv scripts/evaluation_config.toml

Expected output:

▶️  Running Case 1: Match of top CC vs top SA on unambiguously Codable

Full Match Details:
🎯 Full Match Accuracy: 55.20% (706/1279)

2-Digit Match Details:
🔢 2-Digit Match Accuracy: 70.80% (906/1279)

--- 📊 Overall Summary ---
Total Records Analyzed:     1291
Jaccard Similarity Index:   0.5469
Overall Accuracy (Full):    55.20%
------------------------------

marklondon999 added 3 commits July 30, 2025 16:19

(feat): modified coder_alignment to use column config class

bdf3348

(feat): script to run metrics from final output csv files

a455a65

(feat): config toml file corrected

91d1301

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SA-160_Runner Code for evaluation #21

SA-160_Runner Code for evaluation #21

Uh oh!

marklondon999 commented Jul 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SA-160_Runner Code for evaluation #21

Are you sure you want to change the base?

SA-160_Runner Code for evaluation #21

Uh oh!

Conversation

marklondon999 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title: feat: Add configuration-driven evaluation runner script

Background

Description

Key Changes

How to Test

✅ Checklist

🔍 How to Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marklondon999 commented Jul 30, 2025 •

edited

Loading