Sa 160 output processing script #12

marklondon999 · 2025-06-20T14:02:25Z

Title: Introduce JsonPreprocessor for Data Processing

Summary

This pull request introduces a configuration-driven workflow for processing raw JSON outputs from Survey Assist, preparing the data for analysis, and running evaluation metrics.

Background

A labeled set of clerical coder assignments has been run through Survey Assist and the json output file(s) need to be processed.
Currently, there is no standardised method for processing the JSON files produced by batch runs of Survey Assist. This makes it difficult to evaluate the performance of different models or prompts. This change introduces a pipeline to solve this problem.

Motivation

This PR provides a link from the Docker run data set outputs to the module that calculates the final metrics. The class JsonPreprocessor allows a developer to take the raw JSON output from a model run, flatten it into a structured format, merge it with the human-coded input SIC data, and generate a set of performance metrics. This is all controlled by a configuration file.

Changes Introduced

JsonPreprocessor Class:

A new class responsible for fetching and processing raw JSON files directly from Google Cloud Storage (GCS). It can operate in two modes:

Single File Mode: Processes a single, specified JSON file.
Directory Mode: Scans a GCS directory and processes all files created after a specified date assuming that these files are the needed ones. This is used when a run has been interrupted, or is too large to run in one go.

Runner Script (run_json_processor.py):

A Jupyter Notebook (in .py format) that uses a .toml configuration file to manage the entire workflow, from fetching data to producing final metrics.

Note, the config also allows a list of prepared files to be processed through the class LabelAccuracy so that a comparison between different experiments can be made.

Control

How the `prepare_config.toml` Works

The process is controlled by the prepare_config.toml file, which is broken down into two main sections:

`[paths]`

This section defines all the input and output locations in GCS.

batch_filepath: The location of the original, human-coded input data. This is a source location.
gcs_bucket_name: The name of the GCS bucket where data is stored.
gcs_json_dir: The directory within the bucket where the raw JSON output files from model runs are located. This is a source location.
analysis_csv: The location of the human-coded data after it has been enriched with quality flags by the prepare_evaluation_data_for_analysis.py script. This is a destination where the processing will be directed to.
named_file: The full GCS path to a single JSON output file to be processed. This is only used if single_file is set to "True". This is a source location.
merged_file_list: A list of pre-processed CSV files that can be run through the final evaluation metrics. These are source locations.

`[parameters]`

This section controls the behaviour of the JsonPreprocessor.

single_file: If set to "True", the processor will only use the file specified in named_file. If set to "False" or omitted, it will scan the gcs_json_dir directory.
date_since: When single_file is false, this tells the processor to only consider files in the directory created on or after this date (in YYYYMMDD format).

Prerequisites & How to Run

Authenticate with Google Cloud:
```
gcloud auth application-default login
```
Update the Configuration: First change the place holder <my-bucket-name> for the correct location. in the config.
If the notebook run_json_processor.ipynb is being run against the current json output file, the paths defined in the config are all set. If it is to be run against a different data set, the GCS location of that input file needs to be put into batch_filepath and the location of the json output put into named_file.
Convert and Run the Notebook:
The runner script is a Jupytext file. To run it, first convert it to a .ipynb notebook:
```
jupytext --to ipynb notebooks/run_json_processor.py
```
Then, open and run run_json_processor.ipynb in Jupyter.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

[✅ ] Code is formatted using Black
[ ✅] Imports are sorted using isort
[ ✅] Code passes linting with Ruff, Pylint, and Mypy
[✅ ] Security checks pass using Bandit
[✅ ] API and Unit tests are written and pass using pytest
[NA ] Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
[✅ ] DocStrings follow Google-style and are added as per Pylint recommendations
[NA ] Documentation has been updated if needed

🔍 How to Test

In the file prepare_config.toml ensure you do a global replace of the place holder text <my-bucket-name> for the actual GCS bucket name (I have deliberately changed this for security reasons).
Create the notebook using jupytext --to ipynb notebooks/run_json_processor.py
This file batch_filepath = "gs://<my-bucket-name>/evaluation_data/DSC_Rep_Sample.csv" is the source of the original data that has been run by the batch script.
The batch process, when it was run, put the JSON output into this file: named_file = "gs://<my-bucket-name>/analysis_outputs/20250620_153641_output.json"
Since this PR is processing these files, do not change these filenames. For a future evaluation, we will change the filenames to the appropriate locations.
analysis_csv = "gs://<my-bucket-name>/analysis_outputs/added_columns/flags_20250620_153641.csv" is the target location of the output from the script prepare_evaluation_data_for_analysis.py. The exact location is arbitrary.
This file has already been created by the process:
merged_file = "gs://<my-bucket-name>/analysis_outputs/merged_files/merged_flags_20250620_153641.csv"
This is the file that is created by full_output_df = preprocessor.merge_eval_data(llm_processed_df) in the notebook. I have already written it to the bucket.

Running run_json_processor.ipynb will perform a full end-to-end test.

Unit Tests:

Run the new unit tests to verify the class logic in isolation:

pytest tests/test_coder_alignment.py -vv -s

…unts matched and unmatched

…iscard failed duplicates

…om notebook

…lean to coder_alignment.py

…a_for_analysis.py to allow full end to end run

…essing of data runs

…tebook run_json_processor

djyldyz · 2025-07-24T15:04:56Z

notebooks/run_json_processor.py

@@ -0,0 +1,356 @@
+# ---


Why does this need to be a notebook rather than a Python script? What are the advantages of having it as a notebook?

djyldyz · 2025-07-24T15:06:00Z

notebooks/run_json_processor.py

@@ -0,0 +1,356 @@
+# ---
+# jupyter:


There should be a docstring explaining the purpose of the script at the top.

marklondon999 added 30 commits June 20, 2025 09:35

feat(utils): Added preprocessor.py for processing results.

3206d0a

feat(test): added unit test for preprocess.py

4989265

feat(utils): Improved documentation for preprocessor.py

9abd9c1

feat(utils): Analysis of 2000 dataset notebook

fe7047b

feat(chore): Linting of evaluation_metrics_script.py

34c6190

feat(utils): evaluation_metrics_script.py improved with check four co…

e70c0cb

…unts matched and unmatched

feat(utils): configured preprocessor.py to account for retrieds and d…

ca4f9b2

…iscard failed duplicates

feat(utils): evaluation_metrics_script.py linting

aa37bbf

feat(utils): change in location for config lookup to allow running fr…

b431bde

…om notebook

feat(evaluation): alignment assessor class

b3b1043

feat(utils): config for the experiment control WIP

5bd2014

bug(eval): coder_alignment.py bug fix to handle text nan errors

8550eee

feat(eval): refactoring method add_derived_columns, added _melt_and_c…

645a6e1

…lean to coder_alignment.py

bug(eval): linting and debugging of evaluation_metrics_script.py

fd73a8a

feat(eval): Temporary runner for end to end

305f100

feat(eval): config.toml for runner

acec3b1

feat(eval): removed dir check to allow bucket operations

97e0a12

test(eval): Linting and unit tests improved for preprocessor.py

cd3c328

tests(eval): Linting for test_preprocessor.py

cfd3dee

feat(eval): linting of preprocessor.py

a05237a

feat(eval): linring for preprocessor.py

0fccdac

feat(eval): Runner notebook for JsonProcessor and pipeline of evaluation

83c31e7

feat(eval): specific config for processing data from Docker run

756405c

feat(utils): change location of config toml in prepare_evaluation_dat…

8b1e881

…a_for_analysis.py to allow full end to end run

eval(feat): Improving prepare_config.toml for single file processing

7c7e8e5

feat(utils): added file location to prepare_config.toml

1a6bb34

feat(utils): Added single named file operation to preprocessor.py

d7720f1

feat(eval): Making end to end run emmulate the power point presentation

6902676

feat(eval): Improved run_json_processor.py

44de81d

feat(eval): Added a filter for two digits for Jaccard similarity

5f5f613

marklondon999 added 13 commits July 21, 2025 10:32

feat(chore): Linting white space

310446c

feat(eval): Adding a runner notebook for the evaluation and JSON proc…

34c67f9

…essing of data runs

feat(eval): config file for the processing of JSON via the jupyter no…

c6fecf9

…tebook run_json_processor

feat(chore): merge alignment whitespace resolves

27588df

feat(eval): Adding metrics into runner notebook

00ef2c4

eval(chore): remove unnecessary file new_e2e.py

1b879b9

eval(chore): remove unnecessary file evaluation_metrics_script.py

7a5880e

eval(chore): remove unnecessary file tests/test_preprocess.py

fe30b90

eval(chore): remove unnecessary file notebooks/config.toml

65e884e

eval(chore): improve comments

1580e94

eval(chore): correct erroneous entry in prepare_commit.toml

cd49d05

eval(chore): remove out of scope config.toml from PR

b473bcd

eval(chore): correct erroeous deletion

eb4af5f

marklondon999 requested a review from lukeroantreeONS July 23, 2025 13:30

eval(bug fix:) failing test now properly mocked

ddd4d2f

marklondon999 marked this pull request as ready for review July 23, 2025 14:07

marklondon999 requested a review from djyldyz July 24, 2025 10:14

marklondon999 added 5 commits July 24, 2025 11:47

eval(docs): improve the clarity of the run notebook

db369d8

eval(chore): linting for unused argument

df05e37

eval(docs): Improve run notebook

d4fdb28

eval(chore): linting improvement

46ca3c4

eval(chore): linting improvement

2ae0a93

djyldyz reviewed Jul 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sa 160 output processing script #12

Sa 160 output processing script #12

Uh oh!

marklondon999 commented Jun 20, 2025 •

edited

Loading

Uh oh!

djyldyz Jul 24, 2025

Uh oh!

djyldyz Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sa 160 output processing script #12

Are you sure you want to change the base?

Sa 160 output processing script #12

Uh oh!

Conversation

marklondon999 commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title: Introduce JsonPreprocessor for Data Processing

Summary

Background

Motivation

Changes Introduced

JsonPreprocessor Class:

Runner Script (run_json_processor.py):

Control

How the prepare_config.toml Works

[paths]

[parameters]

Prerequisites & How to Run

✅ Checklist

🔍 How to Test

Unit Tests:

Uh oh!

djyldyz Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

djyldyz Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marklondon999 commented Jun 20, 2025 •

edited

Loading

How the `prepare_config.toml` Works

`[paths]`

`[parameters]`