-
Notifications
You must be signed in to change notification settings - Fork 1
Sa 160 output processing script #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…unts matched and unmatched
…iscard failed duplicates
…lean to coder_alignment.py
…a_for_analysis.py to allow full end to end run
…essing of data runs
…tebook run_json_processor
| @@ -0,0 +1,356 @@ | |||
| # --- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be a notebook rather than a Python script? What are the advantages of having it as a notebook?
| @@ -0,0 +1,356 @@ | |||
| # --- | |||
| # jupyter: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a docstring explaining the purpose of the script at the top.
Title: Introduce JsonPreprocessor for Data Processing
Summary
This pull request introduces a configuration-driven workflow for processing raw JSON outputs from Survey Assist, preparing the data for analysis, and running evaluation metrics.
Background
A labeled set of clerical coder assignments has been run through Survey Assist and the json output file(s) need to be processed.
Currently, there is no standardised method for processing the JSON files produced by batch runs of Survey Assist. This makes it difficult to evaluate the performance of different models or prompts. This change introduces a pipeline to solve this problem.
Motivation
This PR provides a link from the Docker run data set outputs to the module that calculates the final metrics. The class JsonPreprocessor allows a developer to take the raw JSON output from a model run, flatten it into a structured format, merge it with the human-coded input SIC data, and generate a set of performance metrics. This is all controlled by a configuration file.
Changes Introduced
JsonPreprocessor Class:
A new class responsible for fetching and processing raw JSON files directly from Google Cloud Storage (GCS). It can operate in two modes:
Runner Script (run_json_processor.py):
A Jupyter Notebook (in .py format) that uses a .toml configuration file to manage the entire workflow, from fetching data to producing final metrics.
Control
How the
prepare_config.tomlWorksThe process is controlled by the
prepare_config.tomlfile, which is broken down into two main sections:[paths]This section defines all the input and output locations in GCS.
batch_filepath: The location of the original, human-coded input data. This is a source location.gcs_bucket_name: The name of the GCS bucket where data is stored.gcs_json_dir: The directory within the bucket where the raw JSON output files from model runs are located. This is a source location.analysis_csv: The location of the human-coded data after it has been enriched with quality flags by theprepare_evaluation_data_for_analysis.pyscript. This is a destination where the processing will be directed to.named_file: The full GCS path to a single JSON output file to be processed. This is only used ifsingle_fileis set to"True". This is a source location.merged_file_list: A list of pre-processed CSV files that can be run through the final evaluation metrics. These are source locations.[parameters]This section controls the behaviour of the
JsonPreprocessor.single_file: If set to"True", the processor will only use the file specified innamed_file. If set to"False"or omitted, it will scan thegcs_json_dirdirectory.date_since: Whensingle_fileis false, this tells the processor to only consider files in the directory created on or after this date (inYYYYMMDDformat).Prerequisites & How to Run
<my-bucket-name>for the correct location. in the config.run_json_processor.ipynbis being run against the current json output file, the paths defined in the config are all set. If it is to be run against a different data set, the GCS location of that input file needs to be put intobatch_filepathand the location of the json output put intonamed_file.The runner script is a Jupytext file. To run it, first convert it to a
.ipynbnotebook:run_json_processor.ipynbin Jupyter.✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
prepare_config.tomlensure you do a global replace of the place holder text<my-bucket-name>for the actual GCS bucket name (I have deliberately changed this for security reasons).jupytext --to ipynb notebooks/run_json_processor.pybatch_filepath = "gs://<my-bucket-name>/evaluation_data/DSC_Rep_Sample.csv"is the source of the original data that has been run by the batch script.named_file = "gs://<my-bucket-name>/analysis_outputs/20250620_153641_output.json"Since this PR is processing these files, do not change these filenames. For a future evaluation, we will change the filenames to the appropriate locations.
analysis_csv = "gs://<my-bucket-name>/analysis_outputs/added_columns/flags_20250620_153641.csv"is the target location of the output from the scriptprepare_evaluation_data_for_analysis.py. The exact location is arbitrary.merged_file = "gs://<my-bucket-name>/analysis_outputs/merged_files/merged_flags_20250620_153641.csv"This is the file that is created by
full_output_df = preprocessor.merge_eval_data(llm_processed_df)in the notebook. I have already written it to the bucket.Running
run_json_processor.ipynbwill perform a full end-to-end test.Unit Tests:
Run the new unit tests to verify the class logic in isolation:
pytest tests/test_coder_alignment.py -vv -s