SA-160_Runner Code for evaluation #21
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Title: feat: Add configuration-driven evaluation runner script
Background
As part of breaking down the functionality of the original SA-160 PR into smaller parts, this provides the final running of the evaluation against a configuration file of the output file created previously up stream in the data pipeline.
Description
This pull request introduces a new, flexible evaluation runner script (
metrics_runner.py) for the Survey Assist project. This script allows for running multiple, distinct evaluation scenarios that are defined in an external TOML configuration file, making performance analysis more efficient, repeatable, and easier to modify.The runner iterates through each case defined in the config, dynamically setting up the comparison between model predictions and clerical coder labels (e.g., top 1 vs. top 1, top 1 vs. any of top 5, etc.). It then calculates and prints a clear, formatted report for each scenario, including full match accuracy, 2-digit accuracy, and the Jaccard similarity index.
This approach replaces hard-coded evaluation logic with a configuration-driven system for ongoing model assessment and experimentation.
Key Changes
metrics_runner.py): A command-line tool that orchestrates the entire evaluation process. It handles loading data, parsing the configuration, and printing results.evaluation_config.toml): Introduces a new configuration file where users can easily define or modify evaluation cases. Each case can specify:LLMs) to consider.CCs) to compare against.Unambiguousrecords.ColumnConfigfor each case based on the TOML file, ensuring theLabelAccuracyanalyzer receives the correct parameters for its calculations.How to Test
Use
evaluation_config.tomlto define the metrics wanted. An example is inscripts/Ensure you have a processed data file (e.g., in
data/).Run the script from your terminal with the paths to your data and config files:
Verify the Output:
✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
Expected output: